Bioinformatics Midterm
An example of an expression that has the Boolean value of TRUE
! ""
If the array, @students= ("Andrew", "Jennifer", "Patrick"), what is the value of $students[0]?
"Andrew"
An example of a valid scalar variable in Perl
$Variable_10
Perl's favorite default variable
$_
hash table prefix
%
Functional signals
5′ splice site (donor), 3′ splice site (acceptor), translational start site and stop codon, etc.
Over ___% of the human genome is transcribed
75
The human genome has ___ million regulatory sites, spanning twice as much DNA as protein-coding genes
84
Repetitive sequences comprise ____% of the genome
>50
UniGene
A secondary database that provides a non-redundant set of gene transcripts for an organism
Hidden Markov Models (HMMs)
-A class of probabilistic models that are generally applicable to time series or linear sequences
Archival (Primary) Databases
-Contain experimental data -Not modified by curators -Could be highly redundant -For example, databases of DNA/protein sequences and structures (GenBank, Protein Data Bank, etc.)
Derived (Secondary) Databases
-Contain information derived from the archival databases, and inferred from analysis of the contents -Often annotated by experts and curators -May contain computationally derived results -For example, databases of sequence motifs and scientific publications (Pfam, PubMed, etc.)
Sequence Identifiers of FASTA format
-GenBank accession.version: e.g., NM_139058.1 -GI (GeneInfo) numbers: used mainly by NCBI -Other database accession numbers
Boolean true/false rules
-If the value is a number, 0 means false; all other numbers mean true -If the value is a string, the empty string ('') and the string '0' means false; all other strings mean true
Perl's 6 special filehandle names
-STDIN -STDOUT -STDERR -DATA -ARGV -ARGVOUT
Framework for Ab Initio Gene Prediciton
-The initial (5′) exon is preceded by a core promoter with sequence elements such as the TATA box -Internal exons are free of in-frame stop codons, and are delimited by splice signals such as 5′ GT and 3′ AG -The final (3′) exon often contains a stop codon, followed by a polyadenylation signal
array
-a variable that contains a list of data elements -Each element holds a scalar value -numbered using sequential integers starting at zero and increasing by one for each element: -Use the at sign (@) before the name to refer to the entire list
Perl
-a very high-level language. It is easy, mostly fast, but kind of ugly -optimized for problems that are about 90% working with text and about 10% everything else -especially good for quick-and-dirty solutions
JIGSAW
-an integrative gene prediction program that combines the outputs from other gene finders, splice site predictors and sequence alignments -provides an automated way to take advantage of the many successful gene prediction methods, and can provide significant improvements in accuracy over an individual method -one of the best-performing programs in the ENCODE Genome Annotation Assessment Project (EGASP) competition
Tandem repeats (satellite DNA)
-are DNA sequences with multiple copies arranged next to each other -Certain tandem repeats play structural roles in centromeres and telomeres
AUGUSTUS
-based on a generalized HMM with a new method for modeling intron length distributions -It can be used for ab initio gene prediction, and has a flexible mechanism for incorporating extrinsic information, such as EST and protein alignments -one of the most accurate programs for ab initio gene prediction
BLAST
-by far the most widely used method for sequence similarity analysis -Sequence similarity searches can identify homologous genes that are evolutionarily related in other organisms -an important tool for genome annotation and comparative genomics
Why use a hash table?
-can be more condensed and easier to understand that arrays -Hashes versus arrays: %student_grades = ("Andrew", 92, "Jennifer", 98, "Patrick", 86); VS @students = ("Andrew", "Jennifer", "Patrick"); @grades = ( 92, 98, 86);
<STDIN>
-can be used to read from the standard input stream -each time it is used, Perl reads the next text line from the standard input -The line-input operator returns undef if the end of the input is reached
Prokaryotic Genomes
-considerably less DNA and fewer genes -coding capacity=compact and continuous genes -genes organized into operons -often contain plasmids, which are usually small and circular DNA with additional genes
Scalar Variable Names
-consist of a dollar sign ($) and an identifier -Example: $name ; $Name ; $_n2 ; $n_2
Repeat Masker
-has been the most widely used tool for characterizing repetitive DNA -can be used to identify SINE, LINE, LTR and DNA transposons as well as several other categories of repetitive DNA (simple repeats, low-complexity DNA, and satellite DNA) -searches a DNA sequence query against curated libraries of repeats
Eukaryotic Genomes
-substantially larger genome size with more complex organization -enormous protein-coding capacity, but the majority of DNA does not code for proteins -protein-coding sequences (exons) can be interrupted by noncoding introns, which are removed by splicing from the primary RNA transcript -Alternative splicing allows for various combinations of exons to be joined to form different mRNAs, which produce more than one polypeptide from a gene
Bioinformatics
The field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned
the if control structure
The statements within the curly braces are executed only if the condition returns a true value
T/F: Hidden Markov Models (HMMs) are a class of probabilistic models that have been widely used in gene prediction programs, including GENSCAN, FGENESH, and AUGUSTUS.
True
T/F: In a Perl program, a filehandle is the name of an I/O connection, not necessarily the name of a file.
True
T/F: In a relational database, a foreign key, which is an attribute in one table and the primary key of another table, is used to establish a link between the data in the two tables
True
FGENESH++
An automatic genome annotation pipeline, which applies FGENESH+
FGENESH_C
FGENESH variant that incorporates cDNA/EST sequences
FGENESH+
FGENESH variant using homologous proteins for accurate assembly of predicted exons
FGENESH-2
FGENESH variant using sequences of two related genomes (e.g., human and mouse)
T/F: About 10% of the human genome consists of repetitive elements of various kinds, which can be identified using the software program RepeatMasker.
False
T/F: In a FASTA file with multiple sequences, each sequence record should have a definition line that always starts with the number sign (#).
False
T/F: In the Structured Query Language (SQL), a standard SELECT statement can include a WHERE clause to indicate the table(s) to retrieve data and an IF clause to specify the condition for restricting the rows returned by the query
False
SOAPdenovo
Developed by BGI for de novo assembly of a large genome (human-sized) from short reads (e.g., Illumina reads)
What type of strings does Perl interpret?
Double-quoted strings (ex. $fred = "world!\n")
Key of a Relation
Each row has a value of an attribute (or a set of attributes) that uniquely identifies the row in the table
Primary Key
If a relation has several candidate keys, one is chosen as
Database Development Process
Requirement Analysis, Data Modeling, Database Construction, Application Development
String Concatenation in Perl
The . operator can be used to concatenate, or join, string values
3 genome browsers developed to facilitate human genome annotation
NCBI Map View, UCSC Genome Browser, and Ensembl
Molecular Function, Biological Process, Cellular Component
The Gene Ontology (GO) provides controlled vocabularies for gene functional annotation, and consists of unlinked hierarchies, including ________.
Computational Biology
The analysis and interpretation of various types of biological data, including nucleotide and amino acid sequences, protein domains, protein structures, etc.
PubMed
The best database to search a retrieve biomedical literature information
Functional Annotation by Sequence Similarities
The common practice to transfer functional annotation from a previously annotated protein with significant sequence similarity relies on two assumptions: 1. Proteins with similar sequences have similar functions 2. The previous annotation is correct
TrEMBL
The data resources provided by Uniprot include ________.
Database and software development
The development and implementation of tools that enable efficient access and management of different types of biological information
Theoretical Bioinformatics
The development of new algorithms and statistics with which to assess relationships among members of large data sets
Repbase
a database of prototypic sequences representing repetitive DNA from different eukaryotic species
PAX-6
a master regulatory gene controlling eye development
Protein Folding
a nascent polypeptide acquires a highly organized structure
What do most Perl statements end with?
a semicolon (;)
an ontology
a set of terms, relationships and definitions, which capture the knowledge of a certain domain
TWINSCAN/N-SCAN
a system that extends the GENSCAN model by exploiting comparison of related genomes (e.g., human and mouse).
In a relational database, a relation is ______.
a table of values
Entrez
a text-based search and retrieval system for the federated databases at NCBI; connects DB entries with neighboring and hard links
GENSCAN
an HMM-based program using many higher-order properties of genomic sequences such as gene density, exon size distribution, etc.
In Perl, what do comments run from?
comments run from a pound sign (#) to the end of the line
Transcriptome
comprises all the RNA transcripts synthesized by an organism
Foreign Key
an attribute in one table, which is the primary key of another table
GENOMESCAN
an extension of GENSCAN that incorporates sequence similarity to known proteins.
Data Model
an integrated collection of concepts for describing data, relationships between data, and constraints on the data
DFAM
contains a collection of sequence alignments and hidden Markov models (HMMs) of transposable elements and other repetitive DNA elements
NCBI GenBank
contains an annotated collection of all publicly available DNA sequences
Interspersed Repeats
are repetitive sequences that are scattered around the genome
Perl Scalar Assignment
assignment operator is the equal sign (=)
Scalar Variable identifier
begins with a letter or underscore (_), followed possibly by more letters, or digits, or underscores
each() function
can be used to iterate over an entire hash table and return a key-value pair as a two-element list
DATA filehandle
can be used to read the lines of input data appearing after __END__
Subroutine return values
can be used to return a value from a subroutine
the until control structure
can be used to reverse the condition of a while loop
the defined() function
can be used to tell if a variable has the undef value (different from the empty string '')
the else keyword
can provide an alternative choice
Genome
contains the full set of genes, which determine the primary structures of gene products (polypeptides and RNAs)
Transcription
conversion of genetic information from DNA to RNA
OMIM
founded by Victor McKusick at the Johns Hopkins University, is the electronic version of the catalog of human genes and genetic disorders
unary not operator (!)
get the opposite of any Boolean value
values() function
gives the corresponding values
Protein Data Bank (PDB)
has been the primary repository for protein structures
Scalar Variable
holds exactly one scalar value
Orthologues
homologous genes from different species that are thought to have descended from a common ancestor
Paralogues
homologous genes in the same species
Position-Specific Iterated BLAST (PSI-BLAST)
improves database search sensitivity (detection of true positives) and selectivity (rejection of false positives)
Genetic Code
information carried in the DNA specifies the protein end product
filehandle
is the name of an I/O connection (not the name of a file) for a Perl program; often named in all uppercase letters
Genome Annotation
is the process of identifying genes, their regulatory sequences, and their functions
The probability P of a state path
is the product of all the emission and transition probabilities along the path
What type of relationship(s) between genes and GO terms?
many-to-many relationship
the while loop
repeats a block of code as long as the condition is true
Definition Line of FASTA format
should start with a greater than character (>), which is usually followed by a unique sequence identifier and a short description
DNA Sequence Alignment
similar sequences of bases are lined up for comparison
Proteome
the entire set of proteins translated
Translation
the information from an mRNA is translated to the amino acid sequence of the protein
The Human Genome
the majority (>80%) of the human genome is transcribed
N50 value
the minimum length of contigs which contain half the bases in a given assembly
Whole-genome shotgun sequencing
the most widely used strategy for sequencing and assembling an entire genome
Genomics
the study of genomes, applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble and analyze genomes
Global variables
they are accessible from every part of the program
To impose some discipline in Perl programming, put the _______ pragma at the top of your program
use strict
To define a subroutine
use the keyword sub, the subroutine name (another Perl identifier), and the block of code in curly braces
To invoke a subroutine
use the subroutine name with the ampersand (&) in front
DBMS
used to define, create, query, update, and administer databases
Subroutines
user-defined functions to recycle one chunk of code many times in a program To invoke a subroutine, use the subroutine name with the ampersand (&) in front
The Critical Assessment of Structure Prediction (CASP) program
uses blind tests to judge the methods of protein structure prediction
Extrinsic methods
utilize external data such as ESTs and protein sequences
LINEs
(long interspersed nuclear elements, > 5 kb)
SINEs
(short interspersed nuclear elements, < 500 bp)
Protein-coding sequences often constitute only a small portion of a eukaryotic genome (_____%)
1-10
Ways to Search Biological Databases
1. Given a sequence, or a sequence fragment, find sequences in the database that are similar to it 2. Given a protein structure, find protein structures in the database that are similar to it 3. Given a sequence of a protein with unknown structure, find structures in the database that adopt similar three-dimensional structures 4. Given a protein structure, find sequences in the database that correspond to similar structures
3 unlinked hierarchies of the Gene Ontology (GO)
1. Molecular Function 2. Biological Process 3. Cellular Component
Static Webpage
1. Request a preexisting HTML file 2. Return the contents of the HTML file
Dynamic Webpage
1. Request service with parameters 2. Run a program using the parameters 3. Return program output (HTML) 4.Return the newly created HTML
The human genome encodes about ________ protein-coding genes, accounting for ~1.5% of the genome
23,000
Gene Prediction Software Programs
GENSCAN, FGENESH, Augustus, etc.
Central Dogma of Molecular Biology
Generally how the flow of genetic information works in all organisms, except some viruses with RNA genomes and using reverse transcriptase to make DNA
UCSC Genome Browser
Genomic annotations are presented in the browser as Tracks
FGENESH
HMM-based gene structure prediction
Combiners
Intrinsic and extrinsic methods are used in combination
exon/intron borders are hard to predict
It is difficult for intrinsic (ab initio) gene-finding algorithms to predict protein-coding genes in eukaryotic genomic DNA because ________.
Who created Perl?
Larry Wall
Lexical variables
Private variables
The Gene Ontology (GO)
Providing controlled vocabularies for describing gene products in the domain of molecular biology
SQL
SELECT Attribute(s) FROM Table(s) WHERE Condition(s)
Major Problems of the Protein Structure Prediction
Secondary Structure Prediction, Fold Recognition, Homology Modeling
Database Schema
The logical structure, or the expression of the inter-relationships among the data
GBrowse
a Web-based application for displaying genomic annotations and other features
the unless control structure
a block of code is executed only when the condition is false
Attribute
a column (header) in a relation (table)
hash table
a data structure that can look up values by names called keys
Pfam
a database of common protein domains (multiple sequence alignments and HMMs)
unshift() function
adds new elements to the beginning of an array; reverse operation of shift() function
push() function
adds new elements to the end of an array; reverse operation of pop() function
Operator Precedence
determines which operations in a complex group of operations happen first (Perl follows math rules)
Structural Genomics
focuses on sequencing genomes and analyzing nucleotide sequences to identify genes and other important sequences such as gene regulatory elements
How much of human genomic sequence is identified and masked by RepeatMasker?
over 56%
The chomp() function
removes the newline in a string
Gene Prediciton
refers to the process of identifying the regions of genomic DNA that encode genes
Metabolome
refers to the sum total of all the low-molecular-weight metabolites produced by an organism, cell or tissue
chop() function
removes the last character of a string
Cross join operation
returns the Cartesian product of rows from tables in the join
What is Perl's only data type?
scalar
Intrinsic/ab initio methods
search for exons and introns based on signals and patterns in the genomic DNA
foreach loop
steps through a list of values, executing one iteration for each value
Relational Data Model
the conceptual basis of a relational database, which organizes data into relations (tables)
Print operator
takes a list of items and sends each as a string to the standard output (STDOUT)
sort() function
takes a list of values and sorts them in the internal character order (code point order for strings)
print operator
takes a scalar argument and puts it out to the standard output
shift() function
takes the first element off of an array and returns the value
pop() function
takes the last element off of an array and returns the value
What does the first line in Perl do?
tells the operating system where Perl is installed
The ENCODE Project
to understand the function of the human genome
Accessing outside the hash table gives ______.
undef
keys() function
yields a list of all the keys in a hash table