BCHM 4400 - Exam 1
in UCSC Genome Browser, what button can be used to retrieve annotation data in tab-delimited text format?
"Table Browser" under "Tools"
describe Entrez search statement syntax with common tags
"search term (2+words)" [tag] "search term" [tag]... *tags*: - [ACCN]: Accession - [AUTH]: Author name - [JOUR]: Journal title, official abbreviation, or ISSN # - [ORGN]: Organism, scientific or common name - [PT]: Publication type (Review, Clinical Trial, Letter, etc) - [SUBS]: Chemical substance name - [SYM]: Gene name (symbol) - [UID]: Unique identifier **truncate search terms with (*)*
what are the functional signals for exon prediction?
5′ splice site (donor), 3′ splice site (acceptor), translational start site, stop codon, etc.
repetitive sequences comprise _______ of the genome; what are the 2 main types?
>50% - tandem repeats and interspersed repeats
how much of the human genome is transcribed?
>80% - >30,000 long non-coding RNAs (lncRNAs) - non-coding RNAs (ncRNAs) of >200 nucleotides - key roles in gene regulation, development and disease - other ncRNAs (rRNAs, tRNAs, miRNAs, piRNAs, etc)
global vs. local alignment algorithms
*global alignment algorithms*: compare two sequences along their entire length, and are most applicable to highly similar sequences *local alignment algorithms*: find the most similar regions in two sequences, and are best for sequences that share some degree of similarity or for sequences of different lengths
describe an instance vs. a label in machine learning
*instance*: observations of a specific subject *label*: a response variable applied to all subjects
what is AUGUSTUS?
*intrinsic* program based on a generalized HMM with a new method for modeling intron length distributions - has a flexible mechanism for incorporating extrinsic information, such as EST and protein alignments - **one of the most accurate programs for ab initio gene prediction
describe prokaryotic genome size, coding capacity, and gene organization
*size*: less DNA and fewer genes than eukaryotes - ex: E. coli has 4,277 protein-coding genes *coding capacity*: compact and continuous genes *gene organization*: operons - prokaryotes often contain plasmids, which are usually small and circular DNA with additional genes (e.g., antibiotic resistance)
describe eukaryotic genome size, coding capacity, and coding continuity
*size*: substantially larger with more complex organization than prokaryotic genomes *coding capacity*: enormous protein-coding capacity, but the majority of DNA does not code for proteins *coding continuity*: protein-coding sequences (exons) can be interrupted by noncoding introns, which are removed by splicing from the primary RNA transcript - alternative splicing allows for various combinations of exons to be joined to form different mRNAs, which produce more than one polypeptide from a gene
compare static vs. dynamic web pages
*static* 1. web browser sends a request for a pre-existing HTML file to the web server 2. the web server returns the contents of the file *dynamic* 1. web browser requests service with parameters from the web server 2. the web server runs a program using the parameters through a CGI program 3. the CGI program returns the program output (HTML) to the web server 4. the web server returns the newly-created HTML file to the web browser
what are the 3 tiers for architecture of a DB system?
*top*: interface tier - web interface - client-side programs *middle*: application tier - application logic - DB connection - server-side programs *bottom*: DB tier - data storage - handling inquiries
how do we quantify nucleotide similarity?
+1 for a match, -1 for a mismatch
what are Boolean operators?
- *AND*: both search terms present in a PubMed entry - *OR*: either one or both of the terms to be present - *NOT*: excluding entries with the search term
what are the 5 main BLAST programs?
- *BLASTP*: protein → protein - *BLASTN*: DNA → DNA - *BLASTX*: DNA → protein - *TBLASTN*: protein → DNA (- *TBLASTX*: DNA(6) → DNA(6))
list the 4 software packages included in GMOD
- *Chado*: a set of database schema modules for building a model organism relational database - *GBrowse/JBrowse*: Generic Genome Browser - *Apollo*: genome sequence annotation editor - *Pathway Tools*: for pathway analysis of genomes
list some DBs about biological macromolecules (6)
- *GenBank*: nucleotide sequences - *UniProtKB*: protein sequences - *GEO*: gene expression profiles - *PDB*: protein and nucleotide structures - *KEGG*: biological networks and pathways - *PubMed*: publications
what are 3 ways you can locate a computer on the internet?
- *IP address*: each computer's is unique - *domain name*: human-friendly name associated with an IP address - *URL*: uniform resource locator (link)
list the 6 RefSeq accession number prefixes
- *NG_*: genomic sequence - *NM_*: mRNA - *NP_*: protein - *NR_*: non-coding RNA - *XM_*: model mRNA - *XP_*: model protein
what are the 2 main scoring schemes used for protein sequences?
- PAM (point accepted mutation; first developed) - BLOSUM (blocks substitution matrix)
list the resources presented on the UniProt homepage
- Swiss-Prot - TrEMBL - UniRef - UniParc - Proteomes - other supporting data
describe many-to-many relationships between genes and GO terms
- a gene can be associated with one or more GO terms (gene categories) - one category normally has many genes
describe global alignment algorithms
- based on *dynamic programming* (mathematical technique) - guaranteed to give an optimal global alignment - time required to align sequences x and y = x(y) - may not be used to search a large sequence DB for a match to a query sequence
describe BLOSUM substitution matrices
- based on substitution patterns in conserved motifs from the BLOCKS database - to avoid overweighting closely related sequences, groups of proteins with sequence identities higher than a threshold are replaced by either a single representative or a weighted average (62% threshold = BLOSUM62) - supersede PAM matrices
what are the composition patterns of protein-coding regions?
- codon usage preferences (out of synonymous codon options) - hexanucleotide distributions - frequencies of the pair A...A with k nucleotides apart
describe the columns of BED File Format
- columns 1-3: chromosome, start position, and end position - column 4: name (e.g., RefSeq identifier) - column 5: score (0-1000, increasing shades of gray) - column 6: strand (DNA strand, either '+' or '-') - columns 7-8: thickStart and thickEnd (start and end positions of thick lines) - column 9: itemRgb (RGB value to specify the item color) - column 10-12: blockCount, blockSizes, and blockStarts, displaying the number of blocks (e.g., exons) in each row, the block sizes, and the block start positions
describe Perl scripts
- comments run from a number sign (#) to the end of the line - most statements end with a semicolon (;)
describe the 5 main applications of MSA
- conserved residues are likely to be part of an active site or functional motif - patterns in the sequences are useful in classifying families or subfamilies within a set of homologues - conservation patterns facilitate the identification of distantly related homologues - molecular phylogeny starts with MSA (the most critical part of making a tree) - protein structure predictions are more reliable if based on an MSA than on a single sequence
describe relational tables
- contain a set of rows (tuples) - data elements in each row represent certain facts that correspond to a real-world entity or relationship - each column (header) is called an attribute
describe PAM substitution matrices
- derived from highly similar sequences of protein families - measures sequence divergence --- 1 PAM = 1% accepted mutation (~99% sequence identity) --- the *PAM250* level (~20% overall identity) is the *lowest sequence similarity* for which a correct alignment may be produced by pairwise sequence comparison alone **the higher the #, the lower the % identity*
Perl double-quoted strings vs. single-quoted strings
- double-quoted strings are interpreted by Perl - single-quoted strings: Perl does not interpret the characters within the single quotes **/n indicates the end of a line*
what are some new data streams that represent the dynamic states of a ceel/organism?
- genomic DNA methylation patterns - RNA content of a cell RNA splice variants - post-translational modifications of proteins - protein-protein interactions - protein-DNA interactions
what are the 2 main problems associated with sequence alignment?
- how to efficiently examine all possible alignments? - how to score the quality of each alignment?
describe the overall framework for intrinsic/ab initio gene prediction
- initial (5′) exon is preceded by a core promoter with sequence elements such as the TATA box - internal exons do not have stop codons and are instead bound by splice signals such as 5′ GT and 3′ AG - final (3′) exon often contains a stop codon, followed by a polyadenylation signal
how do you handle too few BLAST search results?
- raise the expect value threshold - try scoring matrices with lower BLOSUM numbers or higher PAM numbers - search additional databases - use specialized BLAST programs
what are the advantages of using the standalone BLAST package from NCBI?
- saves time - allow you to create your own target DBs - a must for large-scale sequence analysis
how do you handle too many BLAST search results?
- select a RefSeq database to reduce redundancy - limit the results by an organism or group - adjust the scoring matrix - use just a portion of the query sequence
what are the 4 methods of biological information retrieval
- text/keyword search (NCBI Entrez) - sequence-based search (BLAST) - profile-based search (PSI-BLAST) - structure-based search (VAST)
describe the occasionally dishonest casino problem
- the probability of moving from a fair to a fair die is 0.95, while the probability for moving from a fair to a loaded die is 0.05 (or vice versa) - the probability of rolling a 6 on a fair dice is 0.167, while the probability of rolling a 6 on a loaded dice is 0.5 - the state sequence/path is hidden (*H*MM)
describe local alignment algorithms
- useful in DB searching - may miss some conserved regions
what is the format of GenBank qualifiers?
/name = "value"
T/F: PSI-BLAST is less powerful than simple pairwise BLASTP for the identification of distant homologues
FALSE - PSI-BLAST is more powerful than BLASTP
T/F: most eukaryotic genes do not have exons and introns
FALSE - the presence of introns and exons makes gene predictions much harder in eukaryotes compared to prokaryotes
what are the 2 types of biological data integration?
DB federation and warehousing
what is the main difference between a common table and a DB tables?
DB tables have a key
what is the ENCODE project?
Encyclopedia of DNA Elements - trying to understand the function of the human genome - >80% of the genome is transcribed
what 2 machine learning algorithms are used in the construction of Pfam HMMs?
Hmmbuild, Hmmalign
RepeatMasker searches a DNA sequence query against curated libraries of repeats, including what 2 examples?
Repbase and Dfam
what is Repbase?
a database of prototypic sequences representing repetitive DNA from different eukaryotic species
what is FASTA format? what are the identifiers?
a sequence file format; should start with (>), followed by a unique sequence identifier and a short description *sequence identifiers* - GenBank accession version ex.: >NM_12345.1 Homo sapiens
what is a computer network?
a set of computers that are connected and able to exchange data
what is ontology?
a set of terms, relationships, and definitions, which capture the knowledge of a certain domain - terms represent a controlled vocabulary and define the concepts of a domain - terms are linked by relationships, which constitute a semantic network - augment natural language annotations and can be more easily processed computationally
what is a dotplot?
a simple picture that gives an overview of the similarities between two sequences - does not provide a robust statistical measure for the alignment quality - can be used to find regions of local alignment, insertions, or deletions (shifts) - tools include Dotlet, Dotter, and Dottup
what is TWINSCAN/N-SCAN?
a system that extends the GENSCAN model by exploiting comparison of related genomes (e.g., human and mouse); uses external data
what is the Entrez system?
a text-based search and retrieval system for the federated databases at NCBI - connects DB entries with neighboring and hard links
what is a table key?
a value of an attribute (or a set of attributes) that uniquely identifies each row in the table - if a relational table has several candidate keys, one is chosen as the *primary key*
what is Perl?
a very high-level language. It is easy, mostly fast, but kind of ugly - optimized for problems that are about 90% working with text and about 10% everything else - especially good for quick-and-dirty solutions - has only *scalar* data
match each of the following sample questions with the appropriate BLAST search strategy a) what known protein is a lipocalin EST most related to? b) what other proteins are related to RBP4 protein? c) is there an RBP4 ortholog represented in a genomic DNA DB? d) is the 3' untranslated region of human RBP4 DNA homologous to the 3' untranslated region of RBP paralogues or orthologues?
a) BLASTX b) BLASTP c) TBLASTN d) BLASTN
what PAM/BLOSUM matrices are best for the following? a) short alignments that are very similar b) detecting members of a protein family c) long alignments of divergent sequences
a) PAM40, BLOSUM90 b) PAM160, BLOSUM80 c) PAM250, BLOSUM30 **BLOSUM62 is good for all potential similarities*
describe the speeds of the following: a) BLAST b) HMM c) Threading
a) VERY fast b) fast c) VERY slow
describe the sensitivity levels for the following: a) BLAST b) HMM c) Threading
a) low b) high c) VERY high
define source code
code in its text-based form that is yet to be translated into machine code
what is FGENESH? what are the different variants?
an *extrinsic* HMM-based gene structure prediction (multiple genes, both chains) - FGENESH+: variant using homologous proteins for accurate assembly of predicted exons - FGENESH++: an automatic genome annotation pipeline, which applies FGENESH+ - FGENESH_C: variant that incorporates cDNA/EST sequences - FGENESH-2: variant using sequences of two related genomes (e.g., human and mouse)
what is GENSCAN?
an *intrinsic* HMM-based program using many higher-order properties of genomic sequences such as gene density, exon size distribution, etc.
describe iterative methods
compute a suboptimal solution using a progressive alignment strategy, and then modify the alignment using dynamic programming or other methods until a solution converges - **can be used to overcome the inherent limitation of progressive alignment* - utilized by Clustal Omega, MAFFT (Multiple Alignment using Fast Fourier Transform), and MUSCLE (MUltiple Sequence Comparison by Log-Expectation)
what is a foreign key?
an attribute in one table, which is the primary key of another table - used to establish and enforce a link between the data in two tables
what is GENOMESCAN?
an extension of GENSCAN that incorporates sequence similarity to known proteins; uses external data
define genome
an organism's full set of genes, which determine the primary structures of gene products (polypeptides and RNAs)
what is Dfam?
contains a collection of sequence alignments and hidden Markov models (HMMs) of transposable elements and other repetitive DNA elements
what is InterPro?
contains an integrated and curated collection of protein families, domains and motifs from PROSITE, PRINTS, Pfam, SMART, etc.
name 2 sequence assembly algorithms
assembly is often based on alignment - overlap graph - De Bruijn graph
what are neighboring relationships?
based on statistical measures of similarity between database entries - related sequences: BLAST is used for pair-wise comparison of sequences - related articles: Entrez uses the weighted key terms method, which takes into account common terms, proximity, and background frequency
what are hypertext links?
between related entries in different databases; simple way of "integrating" biological data from different sources
define executable
binary code; data that is meant to be run on a CPU
the majority of data is ____________ (dynamic/static)
dynamic
the Perl assignment operator is ______
equal sign (=)
describe consistency-based methods
essentially applies the commutative property (if residue x aligns with y and y aligns with z, then x should align with z) - score pairwise alignments in the context of information about multiple sequences - often generate final MSAs that are *more accurate than those achieved by progressive alignment methods* - ProbCons and T-COFFEE
describe archival (primary) DBs
for data archives - contain experimental data - not modified by curators - could be highly redundant - ex: DBs of DNA/protein sequences and structures (GenBank, Protein Data Bank, etc.)
what is error-correction learning?
for each of the given examples, a computer program makes a prediction based on what was already learned (i.e., model parameters), compares the prediction with the given output to calculate the error, and adjusts the model parameters in some way (learning algorithm) to minimize the error
what is GMOD?
generic model organism DB; project aims to develop a collection of software tools for managing, visualizing, storing, and disseminating genetic and genomic data
what is an HSP?
high-scoring segment pair; consists of two sequence segments whose alignment is locally maximal and meets the score threshold - more than 1 may be found - cumulative score is the sum of the position-by-position scores - T is the neighborhood score threshold, S is the actual threshold needed to return a hit, and X is the significance decay
what are orthologues?
homologous genes from different species that are thought to have descended from a common ancestor
what are paralogues?
homologous genes in the same species
what are the Ensembl identifiers for humans, mice, and fruit flies?
humans: ENS mice: ENSMUS fruit flies: FB
describe structure-based methods
it is possible to improve the accuracy of MSAs by including the 3-D structure information of proteins - tertiary structures evolve more slowly than primary sequences - PRALINE and T-COFFEE Expresso
what is the Smith-Waterman algorithm?
local alignment algorithm; the most rigorous method by which regions of two protein or DNA sequences can be aligned - guaranteed to find the optimal alignment(s) between two sequences, but it is relatively *slow*
Bacterial genes typically correspond to _______________________
long open reading frames (ORF)
what is MSA?
multiple sequence alignment; a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned - can reveal sequence conservation patterns --- homologous residues may be aligned in columns across the length of the sequences --- amino acid residues of different physiochemical types are often displayed in different colors
describe scalar variables
names consist of a dollar sign ($) and an identifier - identifier begins with a *letter or underscore (_)*, followed possibly by more letters, or digits, or underscores - ex: $Clemson, $_c2, $example
what is NLP?
natural language processing; based on machine learning, it enables computers to derive meaning from natural language input
what is BLAT?
nucleotide or protein sequence-based search
what are scalars?
number and simple strings; scalar variables only hold one scalar value
what has been the evolution of biological research?
observation-driven / theory ↓ hypothesis-driven / experiment ↓ data-driven / hypothesis-generating
what are hard links?
provided for existing connections between database entries, e.g., a link between a PubMed entry and its corresponding sequence entries
what is gene ontology (GO)?
provides controlled vocabularies for describing gene products in molecular biology, enabling a common understanding of model organisms and between databases
identifying ___________________________________ in eukaryotic DNA is essential for genome analysis
repetitive elements - repetitive DNA has important roles in chromosome structure, recombination events, and the function of some genes - it is important to locate repetitive elements before DNA sequence comparison and gene prediction
describe interspersed repeats
repetitive sequences that are scattered around the genome; most are a result of transposition
what is SRA?
sequence read archives; stores raw sequence data and alignment information from high-throughput sequencing platforms, and makes the data available to the research community
what does structural genomics study?
sequencing genomes and analyzing nucleotide sequences to identify genes and other important sequences such as gene regulatory elements - approaches: --- clone-by-clone sequencing (map-based) --- whole-genome shotgun sequencing (most widely-used)
what are SNPs?
single nucleotide polymorphism; a single base pair change in either a coding or non-coding part of the genome - can be neutral or tied to disease (most are neutral) --- ex: blood type alleles (neutral) --- ex: sickle-cell anemia (disease); A to T substitution
what is database and software development?
the development and implementation of tools that enable efficient access and management of different types of biological information
what is theoretical bioinformatics?
the development of new algorithms and statistics to assess relationships among members of large data sets
define proteome
the entire set of proteins translated - human genome = ~20,000 protein-coding genes - human proteome = ~100,000 proteins
what is an E value?
the expected number of matches that give the same Z score or better if the database is searched with a random sequence - *best way to estimate false-positives* (the lower the better) - E-value cutoffs for BLAST searches include 1e-6 and >/= 70% identity for nucleotide searches, and 1e-3 and >/= 25% identity for protein searches
what is sequence alignment?
the identification of residue-residue correspondence
what is PubMed?
the literature component of Entrez at NCBI - provides access to over 30 million citations for biomedical literature from MEDLINE (Medical Literature, Analysis, and Retrieval System Online) and other sources
what is RepeatMasker?
the most widely used tool for characterizing repetitive DNA - can be used to identify SINE, LINE, LTR and DNA transposons as well as several other categories of repetitive DNA (simple repeats, low-complexity DNA, and satellite DNA) - over 56% of human genomic sequence is identified and masked by RepeatMasker
the numbers of a PSSM reflect...
the probability of any amino acid occurring at each position; also reflects the effect of a conservative or non-conservative substitution at each position in the multiple sequence alignment, much like a BLOSUM matrix does
what is an P value?
the probability that the observed match could have happened by chance
what is genome annotation? what are some software programs that do this?
the process of identifying the location and function of genes - GENSCAN, FGENESH, AUGUSTUS, etc.
the probability P of a state path, given the model and an observation (sequence), is equal to...
the product of all the emission and transition probabilities along the path
what is machine learning?
the study of computer algorithms that automatically improve performance through experience - may also mean that we have a set of examples from which we want to extract some patterns using computers
define genomics; what are 4 main sub-disciplines?
the study of genomes; applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble and analyze genomes - *disciplines* --- structural --- functional --- comparative --- metagenomics
what is the unifying principle?
use the same protocol (TCP/IP) for network interconnection
what is Ensembl BioMart?
used to retrieve Ensembl data in text format - users specify a dataset, filters, and attributes for download
in Ensemble, data are presented as ___________, with each showing a different level of detail
views
DB federation vs. DB warehousing
*DB federation*: integrates different data sources using an application-specific, generalized schema, but leaves data at the remote sources - ex: NCBI Entrez, IBM DiscoveryLink *DB warehousing*: consolidates all specified data into a local database with a generalized schema - ex: UniProt, UCSC Genome Browser, FlyBase
what are the 3 states of HMM protein domain models?
*M*: match *D*: delete *I*: insert
describe 2 software programs for genome assembly
*Phred/Phrap/Consed suite* - Phred: base caller - Phrap: assembler - Consed: assembly viewer and editor *SOAPdenovo* - developed by BGI for de novo assembly of a large genome (human-sized) from short reads (e.g., Illumina reads)
describe the advantages and disadvantages of DB warehousing
*advantages* - clean/curated data - fast queries - data is kept in local DBs (downloadable) *disadvantages* - stale data - complex schema - data consolidation
describe the advantages and disadvantages of DB federation
*advantages* - current data - flexible architecture - no data consolidation (time-consuming) *disadvantages* - slower queries - complex schema - keeps data in their original sources
what is NCBI GenBank?
*archival (primary) DB*; one of the world's largest DNA sequence DBs - contains an annotated collection of all publicly available DNA sequences
the common practice to transfer functional annotation from a previously annotated protein with significant sequence similarity relies on what two assumptions? what does this approach require?
*assumptions:* - proteins with similar sequences have similar functions - the previous annotation is correct *requires:* - efficient sequence alignment algorithms and sequence similarity measures - high-quality databases of known protein sequences
describe the diagram for DB access through a web interface
*client* ⇌ [internet (HTTP)] ⇌ *web server* (NCBI) ⇌ [SQL] ⇌ *DBMS* (GenBank)
what is JIGSAW?
*combination* integrative gene prediction program that combines the outputs from other gene finders, splice site predictors and sequence alignments - provides an automated way to take advantage of the many successful gene prediction methods, and can provide significant improvements in accuracy over an individual method - one of the best-performing programs in the ENCODE Genome Annotation Assessment Project (EGASP) competition
describe data model vs. relational data model
*data model*: an integrated collection of concepts for describing data, relationships between data, and constraints on the data *relational data model*: the conceptual basis of a relational database, which organizes data into tables (relations)
list 3 variants of BLAST other than PSI-BLAST
- *PHI-BLAST* (Pattern Hit Initiated BLAST): inputs a protein sequence and a pattern contained in the sequence, and searches the target DB for other proteins that also contain the input pattern and have significant similarity to the query sequence - *MegaBLAST*: optimized for aligning either long or highly similar nucleotide sequences (>95%); is ~10 times faster than BLASTN, and thus may be used to find if a query sequence is part of a larger contig - *BLAT* (BLAST-Like Alignment Tool): designed to rapidly align long nucleotide sequences with >95% similarity, and is particularly well suited for finding the position of a sequence (e.g., cDNA) in the genome
what are the 3 UniRef clusters?
- *UniRef100*: derived by combining identical sequences and sub-fragments - *UniRef90*: built by clustering sequences with at least 90% identity and 80% overlap - *UniRef50*: built by clustering sequences with at least 50% identity and 80% overlap
describe client-server computing
- *client*: software that requests services - *server*: software that provides services - *services*: sending data or HTML files, running programs on the server computer (host), etc. (see diagram)
describe the 2 types of table join operations
- *cross join*: returns the Cartesian product of rows from tables in the join - *equi-join*: joins two or more tables where the specified columns are equal
describe the 5 approaches to MSA
- *exact methods*: employing dynamic programming, not feasible in time and space for many sequences - *progressive alignment*: ClustalW has been the most popular program from the 1990s until recently - *iterative methods*: Clustal Omega, MAFFT, and MUSCLE are fast and highly accurate - *consistency-based methods*: ProbCons and T-COFFEE are slower but more accurate - *structure-based methods*: using protein structural information to improve the alignment accuracy
describe the 3 types of eukaryotic single-copy noncoding regions of DNA
- *gene regulatory sequences*: promoters and enhancers - *pseudogenes*: dysfunctional gene copies with significant mutations, usually not transcribed - *noncoding RNA genes*: rRNAs, tRNAs, microRNAs, long noncoding RNAs (lncRNAs), etc.
what programs are included in HMMER?
- *hmmbuild*: for building a profile HMM from a MSA - *hmmsearch*: for searching a sequence database using profile HMM(s) - *hmmscan*: for searching a profile HMM database using protein sequence(s) - *hmmalign*: used to align sequences to an HMM - *hmmlogo*: given an HMM, producing data required to build an HMM logo
what are the 3 methods for finding protein-coding genes in eukaryotes?
- *intrinsic/ab initio*: search for exons and introns based on signals and patterns in the genomic DNA - *extrinsic*: utilize external data such as transcriptional data and protein sequences --- homology-based gene prediction compares genomic sequences of interest against known coding regions --- comparative gene prediction compares genomic sequences of interest with other available genomes --- gene finding and annotation using RNA-sequence data - *combiners*: intrinsic and extrinsic methods are used in combination
what is included in the standalone BLAST package from NCBI?
- *makeblastdb*: used to format protein or nucleotide sequence DBs for BLAST searches - *blastp*: for BLASTP searches in batch mode - *blastn*: for BLASTN searches in batch mode - *blastx*: for BLASTX searches in batch mode - *tblastn*: for TBLASTN searches in batch mode - *tblastx*: for TBLASTX searches in batch mode - *psiblast*: used for PSI-BLAST searches - *rpsblast*: used to search a protein sequence query against the conserved domain DB (CDD)
what are the 3 unlinked hierarchies of GO?
- *molecular function*: elemental activity/task (e.g., DNA-binding, polymerase, transcription factor) - *biological process*: goal or objective (e.g., mitosis, DNA replication, cell cycle control) - *cellular component*: location or complex (e.g., nucleus, ribosome, pre-replication complex)
what are the 3 main tasks of protein structure prediction?
- *secondary structure prediction* (prediction of alpha helices and beta sheets) - *homology modelling* (predict 3D structure based on known structures of homologous proteins) - *fold recognition* (compare structure to known structures in a library based on the folding pattern)
describe the 2 types of machine learning
- *supervised*: learning with a teacher (by using a set of input-output training examples); training and testing - *unsupervised*: to let the computer explore the data space and find some interesting patterns; clustering
what are the 4 relationships for data entity association?
- 1 : 1 (university : president) - 1 : many (university : department) - many : 1 (student : course) - many : many (gene : GO terms)
what are the 5 sources of information retrieval from NCBI Entrez?
- Entrez Gene - OMIM - GEO - SRA - dbSNP
which 3 genome browsers were developed to facilitate humans genome annotation?
- NCBI Map View / Genome Data Viewer - UCSC Genome Browser (user-friendly, most widely used) - Ensembl (less user-friendly, but has downloadable data) **each genome browser provides a Web interface to query and browse the data*
what are the 4 steps of the DB development process?
1. *requirement analysis*: system requirements, database requirements, and system design 2. *data modeling*: conceptual data model, logical data model, and physical data model 3. *DB construction*: schema implementation, data population, and query optimization 4. *application development*: database interfaces, data mining tools, and system testing **the DB approach separates data modeling from application development (think of the data first and the application second)*
describe the steps of construction of the Pfam HMMs
1. PROSITE, literature 2. family definition 3. seed alignment 4. HMM profile 5. full alignment **process returns to step 2 from step 4 is HMM doesn't find all the members
what is the best solution to the following biological problem? Suppose there is a novel gene identified in Drosophila and C. elegans, but not yet in the human genome. This gene is involved in an interesting biological process, and you are interested in finding the orthologous gene in the human genome. However, BLAST search using each of the known sequences failed to identify the human homologue. What else can you try?
1. collect all known sequences in literature 2. do MSA and editing 3. create a profile HMM using hmmbuild 4. search a protein sequence DB using the HMM and hmmsearch
describe the steps of BLAST algorithm dotplots
1. empty dotplot 2. search DB with query words (no gaps) 3. extend each match in both directions 4. perform local alignments with mismatches and gaps
list the steps of UniProt data warehousing approach
1. external sources 2. UniParc (cleans up collected data) 3. UniRef OR UniProt KB 4. (in UniProt KB) TrEMBL, Swiss-Prot) - Swiss-Prot automatically annotates TrEMBL 5. proteomes or UniRef
what are the 6 steps of supervised machine learning?
1. question 2. data 3. features 4. algorithms 5. parameters 6. evaluation
what are the 3 main stages of progressive sequence alignment?
1. series of pairwise alignments 2. create a guide tree 3. MSA created based on the order of the guide tree
describe BLAST search strategies (5 programs)
1. starting point is a molecular sequence 2. BLASTP, BLASTN, BLASTX, TBLASTN (TBLASTX) - **chose which program to use based on what your sequence is and what you want to do with it*
Smith-Waterman algorithm vs. BLAST
BLAST is a rapid, heuristic version of the Smith-Waterman algorithm (not guaranteed to find the optimal alignments) - BLAST takes a small integer w ("word size"), and determine all instances of the words in a query sequence that occur in any sequence in the database
what is BLAST?
Basic Local Alignment Search Tool - by far the most widely used method for sequence similarity analysis - sequence similarity searches can identify homologous genes that are evolutionarily related in other organisms - an important tool for genome annotation and comparative genomics
what are affine gap penalties?
G + Ln - n = length of gap
what are HMMs?
Hidden Markov Models - class of probabilistic models that are generally applicable to time series or linear sequences - used in speech recognition, as well as gene prediction and protein domain modeling
describe overall protein domain analysis using HMM
MSA → (HMMER) → HMM ⇌ (search) ⇌ your sequence set - HMMER is used to construct HMMs
what is OMIM?
Online Mendelian Inheritance in Man - electronic version of the catalog of human genes and genetic disorders
what is PDB?
Protein Data Bank; *mostly archival (primary, partially derived (secondary) DB* - the primary repository for protein structures
T/F: in the higher eukaryotes, genome size is not necessarily a measure of the complexity of the organism
TRUE
T/F: the vast majority of a eukaryotic genome does not encode protein
TRUE - protein-coding sequences often constitute only a small portion of a eukaryotic genome (1-10%)
how do we measure success of genome assembly?
a *longer N50 value* indicates greater success; the extent to which the *assembly spans ESTs and cDNAs* is a measure of completeness - N50: minimum length of contigs which contain half the bases in a given assembly - as the assembly becomes more complete, the absolute number of contigs and scaffolds decreases
what is UniProt? what are its 3 participants?
a comprehensive resource for protein sequence and functional information, and is participated by: - Swiss-Prot: high-quality protein records with *manual* annotation based on literature and computational results; grows slowly - TrEMBL: translated coding sequences (CDS) in EMBL; *automatically* annotated (not reviewed); grows rapidly - PIR (Protein Information Resource): probably the first sequence database started in the early 1960s at Georgetown University; produced the Protein Sequence Database (PIR-PSD)
what is FlyBase?
a database of Drosophila genes and genomes - access through text search, BLAST, and JBrowse
what is Pfam?
a database of common protein domains (multiple sequence alignments and HMMs) - one of the most trusted and widely used resources for protein families
describe tandem repeats
aka satellite DNA; DNA sequences with multiple copies arranged next to each other - play structural roles in centromeres and telomeres - microsatellites (1-4 bp) and minisatellites (10-100 bp) are two relatively small repetitive sequence types that are used as markers in genetic disease diagnosis, kinship and population studies, and forensic investigations
define transcriptome
all the RNA transcripts synthesized by an organism
what is DBMS?
database management system; used to define, create, query, update, and administer databases - a tool to help developers make DBs
what are the 3 main learning algorithms?
decision tree, artificial neural network (ANN), and support vector machine (SVM)
what is the PSI-BLAST threshold?
default 0.005; it is the E value cutoff for inclusion of hits for the next round - all the hits with E values better than the inclusion threshold are used to construct a PSSM, unless the user edits the inclusion list manually
GO annotation is based on...
evidence from biological experiments and/or computational analyses
protein domains represent...
evolutionarily conserved amino acid sequences carrying functional and structural information - HMMs are used for protein domain modeling
what do protein domain represent?
evolutionarily conserved amino acid sequences carrying functional and structural information - domain analysis helps understand the biological function of a gene - HMMs are used for protein domain modeling
what does the PAX-6 gene do?
first cloned in mouse and human, it is a master regulatory gene controlling a complex cascade of events in eye development - mutations cause *aniridia*, a developmental defect in which the iris of the eye is absent or deformed - flies with a mutated PAX-6 gene develop without eyes or develop ectopic eyes (Drosophila)
describe derived (secondary) DBs
for curated reviews - contain information derived from the archival databases, and inferred from analysis of the contents - often annotated by experts and curators - may contain computationally derived results - ex: DBs of sequence motifs and scientific publications (Pfam, PubMed, etc.)
what are the 2 types of biological networks?
physical and logical - physical data is often represented in a burst graph, while logical data is presented in a table (genes, protein, metabolites)
what is PSI-BLAST? what is the method associated with it?
position-specific iterated BLAST; particularly well suited for identifying distantly related proteins, which may not be found using BLASTP *method* 1. take a query protein sequence and perform a standard BLASTP search 2. the hits, along with the query sequence, are used to construct a Position-Specific Scoring Matrix (PSSM) in an automated fashion 3. the PSSM then serves as the query for re-searching the target database 4. the above two steps are repeated until the search either converges or the specified number of iterations is reached
what is RefSeq?
reference sequence DBs that aim to provide a high-quality (curated by experts), comprehensive, non-redundant set of sequences - used for the functional annotation of many genome sequencing projects, including those of human and mouse - accessible via BLAST, Entrez, and FTP
what is RDBMS?
relational database management system; based on the relational model as invented by E.F. Codd of IBM Research in 1970 - Oracle, Microsoft SQL Server, IBM DB2, MySQL, PostgreSQL, etc. - used by most biological DBs
what is an SQL? what are the 3 types of queries?
structured query language; *queries*: - SELECT (attribute(s)) - FROM (table(s)) - WHERE (condition(s))
define metabolome
sum total of all the low-molecular-weight metabolites produced by an organism, cell or tissue
what is coverage in terms of sequence reads?
the average number of times each base appears in the reads - if G = genome size, N = number of reads, and L = length of a read, then *Coverage = NL/G*
what are gap opening penalties (G)?
the cost of creating a gap; example, G = -11
what are gap extension penalties (L)?
the cost of extending the gap by one position; example, L = -1
what are the 2 main entry points in the UCSC Genome Browser?
text-based query and BLAT
what is computational biology?
the analysis and interpretation of various types of biological data, including nucleotide and amino acid sequences, protein domains, protein structures, etc.
describe progressive sequence alignment; what is its major limitation?
this strategy entails calculating pairwise alignment scores between all the protein sequences being aligned, and then beginning the alignment with the two closest sequences and progressively adding more sequences to the alignment - permits the rapid alignment of hundreds/thousands of sequences - *limitation*: the final alignment depends on the order in which sequences are joined, and it is therefore not guaranteed to provide the most accurate alignment; errors cannot be corrected
what is the most basic task of computational analysis of biomedical articles?
to identify the names of genes, proteins, metabolites, drugs, and diseases - next level is to identify interactions and associations --- protein-protein interactions --- associations of genes / proteins with diseases
genomic annotations are presented in the UCSC Genome Browser as ___________
tracks - each provides a different type of feature, from genes to CpG islands and to SNPs - many of the tracks represent results from active research programs (experimental or computational) - can be shown or hidden in the presentation
what are retrotransposons? what are the 2 main types?
transposable elements generated via RNA intermediates - LINEs (long interspersed nuclear elements, > 5 kb) - SINEs (short interspersed nuclear elements, < 500 bp) - **LINEs and SINEs are dispersed throughout the genome and constitute over 1/3 of the human genome
about how many protein-encoding genes are found in the human genome?
~23,000 - accounts for <3% of the genome