Bioinformatics Midterm

¡Supera tus tareas y exámenes ahora con Quizwiz!

An example of an expression that has the Boolean value of TRUE

! ""

If the array, @students= ("Andrew", "Jennifer", "Patrick"), what is the value of $students[0]?

"Andrew"

An example of a valid scalar variable in Perl

$Variable_10

Perl's favorite default variable

$_

hash table prefix

%

Functional signals

5′ splice site (donor), 3′ splice site (acceptor), translational start site and stop codon, etc.

Over ___% of the human genome is transcribed

75

The human genome has ___ million regulatory sites, spanning twice as much DNA as protein-coding genes

84

Repetitive sequences comprise ____% of the genome

>50

UniGene

A secondary database that provides a non-redundant set of gene transcripts for an organism

Hidden Markov Models (HMMs)

-A class of probabilistic models that are generally applicable to time series or linear sequences

Archival (Primary) Databases

-Contain experimental data -Not modified by curators -Could be highly redundant -For example, databases of DNA/protein sequences and structures (GenBank, Protein Data Bank, etc.)

Derived (Secondary) Databases

-Contain information derived from the archival databases, and inferred from analysis of the contents -Often annotated by experts and curators -May contain computationally derived results -For example, databases of sequence motifs and scientific publications (Pfam, PubMed, etc.)

Sequence Identifiers of FASTA format

-GenBank accession.version: e.g., NM_139058.1 -GI (GeneInfo) numbers: used mainly by NCBI -Other database accession numbers

Boolean true/false rules

-If the value is a number, 0 means false; all other numbers mean true -If the value is a string, the empty string ('') and the string '0' means false; all other strings mean true

Perl's 6 special filehandle names

-STDIN -STDOUT -STDERR -DATA -ARGV -ARGVOUT

Framework for Ab Initio Gene Prediciton

-The initial (5′) exon is preceded by a core promoter with sequence elements such as the TATA box -Internal exons are free of in-frame stop codons, and are delimited by splice signals such as 5′ GT and 3′ AG -The final (3′) exon often contains a stop codon, followed by a polyadenylation signal

array

-a variable that contains a list of data elements -Each element holds a scalar value -numbered using sequential integers starting at zero and increasing by one for each element: -Use the at sign (@) before the name to refer to the entire list

Perl

-a very high-level language. It is easy, mostly fast, but kind of ugly -optimized for problems that are about 90% working with text and about 10% everything else -especially good for quick-and-dirty solutions

JIGSAW

-an integrative gene prediction program that combines the outputs from other gene finders, splice site predictors and sequence alignments -provides an automated way to take advantage of the many successful gene prediction methods, and can provide significant improvements in accuracy over an individual method -one of the best-performing programs in the ENCODE Genome Annotation Assessment Project (EGASP) competition

Tandem repeats (satellite DNA)

-are DNA sequences with multiple copies arranged next to each other -Certain tandem repeats play structural roles in centromeres and telomeres

AUGUSTUS

-based on a generalized HMM with a new method for modeling intron length distributions -It can be used for ab initio gene prediction, and has a flexible mechanism for incorporating extrinsic information, such as EST and protein alignments -one of the most accurate programs for ab initio gene prediction

BLAST

-by far the most widely used method for sequence similarity analysis -Sequence similarity searches can identify homologous genes that are evolutionarily related in other organisms -an important tool for genome annotation and comparative genomics

Why use a hash table?

-can be more condensed and easier to understand that arrays -Hashes versus arrays: %student_grades = ("Andrew", 92, "Jennifer", 98, "Patrick", 86); VS @students = ("Andrew", "Jennifer", "Patrick"); @grades = ( 92, 98, 86);

<STDIN>

-can be used to read from the standard input stream -each time it is used, Perl reads the next text line from the standard input -The line-input operator returns undef if the end of the input is reached

Prokaryotic Genomes

-considerably less DNA and fewer genes -coding capacity=compact and continuous genes -genes organized into operons -often contain plasmids, which are usually small and circular DNA with additional genes

Scalar Variable Names

-consist of a dollar sign ($) and an identifier -Example: $name ; $Name ; $_n2 ; $n_2

Repeat Masker

-has been the most widely used tool for characterizing repetitive DNA -can be used to identify SINE, LINE, LTR and DNA transposons as well as several other categories of repetitive DNA (simple repeats, low-complexity DNA, and satellite DNA) -searches a DNA sequence query against curated libraries of repeats

Eukaryotic Genomes

-substantially larger genome size with more complex organization -enormous protein-coding capacity, but the majority of DNA does not code for proteins -protein-coding sequences (exons) can be interrupted by noncoding introns, which are removed by splicing from the primary RNA transcript -Alternative splicing allows for various combinations of exons to be joined to form different mRNAs, which produce more than one polypeptide from a gene

Bioinformatics

The field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned

the if control structure

The statements within the curly braces are executed only if the condition returns a true value

T/F: Hidden Markov Models (HMMs) are a class of probabilistic models that have been widely used in gene prediction programs, including GENSCAN, FGENESH, and AUGUSTUS.

True

T/F: In a Perl program, a filehandle is the name of an I/O connection, not necessarily the name of a file.

True

T/F: In a relational database, a foreign key, which is an attribute in one table and the primary key of another table, is used to establish a link between the data in the two tables

True

FGENESH++

An automatic genome annotation pipeline, which applies FGENESH+

FGENESH_C

FGENESH variant that incorporates cDNA/EST sequences

FGENESH+

FGENESH variant using homologous proteins for accurate assembly of predicted exons

FGENESH-2

FGENESH variant using sequences of two related genomes (e.g., human and mouse)

T/F: About 10% of the human genome consists of repetitive elements of various kinds, which can be identified using the software program RepeatMasker.

False

T/F: In a FASTA file with multiple sequences, each sequence record should have a definition line that always starts with the number sign (#).

False

T/F: In the Structured Query Language (SQL), a standard SELECT statement can include a WHERE clause to indicate the table(s) to retrieve data and an IF clause to specify the condition for restricting the rows returned by the query

False

SOAPdenovo

Developed by BGI for de novo assembly of a large genome (human-sized) from short reads (e.g., Illumina reads)

What type of strings does Perl interpret?

Double-quoted strings (ex. $fred = "world!\n")

Key of a Relation

Each row has a value of an attribute (or a set of attributes) that uniquely identifies the row in the table

Primary Key

If a relation has several candidate keys, one is chosen as

Database Development Process

Requirement Analysis, Data Modeling, Database Construction, Application Development

String Concatenation in Perl

The . operator can be used to concatenate, or join, string values

3 genome browsers developed to facilitate human genome annotation

NCBI Map View, UCSC Genome Browser, and Ensembl

Molecular Function, Biological Process, Cellular Component

The Gene Ontology (GO) provides controlled vocabularies for gene functional annotation, and consists of unlinked hierarchies, including ________.

Computational Biology

The analysis and interpretation of various types of biological data, including nucleotide and amino acid sequences, protein domains, protein structures, etc.

PubMed

The best database to search a retrieve biomedical literature information

Functional Annotation by Sequence Similarities

The common practice to transfer functional annotation from a previously annotated protein with significant sequence similarity relies on two assumptions: 1. Proteins with similar sequences have similar functions 2. The previous annotation is correct

TrEMBL

The data resources provided by Uniprot include ________.

Database and software development

The development and implementation of tools that enable efficient access and management of different types of biological information

Theoretical Bioinformatics

The development of new algorithms and statistics with which to assess relationships among members of large data sets

Repbase

a database of prototypic sequences representing repetitive DNA from different eukaryotic species

PAX-6

a master regulatory gene controlling eye development

Protein Folding

a nascent polypeptide acquires a highly organized structure

What do most Perl statements end with?

a semicolon (;)

an ontology

a set of terms, relationships and definitions, which capture the knowledge of a certain domain

TWINSCAN/N-SCAN

a system that extends the GENSCAN model by exploiting comparison of related genomes (e.g., human and mouse).

In a relational database, a relation is ______.

a table of values

Entrez

a text-based search and retrieval system for the federated databases at NCBI; connects DB entries with neighboring and hard links

GENSCAN

an HMM-based program using many higher-order properties of genomic sequences such as gene density, exon size distribution, etc.

In Perl, what do comments run from?

comments run from a pound sign (#) to the end of the line

Transcriptome

comprises all the RNA transcripts synthesized by an organism

Foreign Key

an attribute in one table, which is the primary key of another table

GENOMESCAN

an extension of GENSCAN that incorporates sequence similarity to known proteins.

Data Model

an integrated collection of concepts for describing data, relationships between data, and constraints on the data

DFAM

contains a collection of sequence alignments and hidden Markov models (HMMs) of transposable elements and other repetitive DNA elements

NCBI GenBank

contains an annotated collection of all publicly available DNA sequences

Interspersed Repeats

are repetitive sequences that are scattered around the genome

Perl Scalar Assignment

assignment operator is the equal sign (=)

Scalar Variable identifier

begins with a letter or underscore (_), followed possibly by more letters, or digits, or underscores

each() function

can be used to iterate over an entire hash table and return a key-value pair as a two-element list

DATA filehandle

can be used to read the lines of input data appearing after __END__

Subroutine return values

can be used to return a value from a subroutine

the until control structure

can be used to reverse the condition of a while loop

the defined() function

can be used to tell if a variable has the undef value (different from the empty string '')

the else keyword

can provide an alternative choice

Genome

contains the full set of genes, which determine the primary structures of gene products (polypeptides and RNAs)

Transcription

conversion of genetic information from DNA to RNA

OMIM

founded by Victor McKusick at the Johns Hopkins University, is the electronic version of the catalog of human genes and genetic disorders

unary not operator (!)

get the opposite of any Boolean value

values() function

gives the corresponding values

Protein Data Bank (PDB)

has been the primary repository for protein structures

Scalar Variable

holds exactly one scalar value

Orthologues

homologous genes from different species that are thought to have descended from a common ancestor

Paralogues

homologous genes in the same species

Position-Specific Iterated BLAST (PSI-BLAST)

improves database search sensitivity (detection of true positives) and selectivity (rejection of false positives)

Genetic Code

information carried in the DNA specifies the protein end product

filehandle

is the name of an I/O connection (not the name of a file) for a Perl program; often named in all uppercase letters

Genome Annotation

is the process of identifying genes, their regulatory sequences, and their functions

The probability P of a state path

is the product of all the emission and transition probabilities along the path

What type of relationship(s) between genes and GO terms?

many-to-many relationship

the while loop

repeats a block of code as long as the condition is true

Definition Line of FASTA format

should start with a greater than character (>), which is usually followed by a unique sequence identifier and a short description

DNA Sequence Alignment

similar sequences of bases are lined up for comparison

Proteome

the entire set of proteins translated

Translation

the information from an mRNA is translated to the amino acid sequence of the protein

The Human Genome

the majority (>80%) of the human genome is transcribed

N50 value

the minimum length of contigs which contain half the bases in a given assembly

Whole-genome shotgun sequencing

the most widely used strategy for sequencing and assembling an entire genome

Genomics

the study of genomes, applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble and analyze genomes

Global variables

they are accessible from every part of the program

To impose some discipline in Perl programming, put the _______ pragma at the top of your program

use strict

To define a subroutine

use the keyword sub, the subroutine name (another Perl identifier), and the block of code in curly braces

To invoke a subroutine

use the subroutine name with the ampersand (&) in front

DBMS

used to define, create, query, update, and administer databases

Subroutines

user-defined functions to recycle one chunk of code many times in a program To invoke a subroutine, use the subroutine name with the ampersand (&) in front

The Critical Assessment of Structure Prediction (CASP) program

uses blind tests to judge the methods of protein structure prediction

Extrinsic methods

utilize external data such as ESTs and protein sequences

LINEs

(long interspersed nuclear elements, > 5 kb)

SINEs

(short interspersed nuclear elements, < 500 bp)

Protein-coding sequences often constitute only a small portion of a eukaryotic genome (_____%)

1-10

Ways to Search Biological Databases

1. Given a sequence, or a sequence fragment, find sequences in the database that are similar to it 2. Given a protein structure, find protein structures in the database that are similar to it 3. Given a sequence of a protein with unknown structure, find structures in the database that adopt similar three-dimensional structures 4. Given a protein structure, find sequences in the database that correspond to similar structures

3 unlinked hierarchies of the Gene Ontology (GO)

1. Molecular Function 2. Biological Process 3. Cellular Component

Static Webpage

1. Request a preexisting HTML file 2. Return the contents of the HTML file

Dynamic Webpage

1. Request service with parameters 2. Run a program using the parameters 3. Return program output (HTML) 4.Return the newly created HTML

The human genome encodes about ________ protein-coding genes, accounting for ~1.5% of the genome

23,000

Gene Prediction Software Programs

GENSCAN, FGENESH, Augustus, etc.

Central Dogma of Molecular Biology

Generally how the flow of genetic information works in all organisms, except some viruses with RNA genomes and using reverse transcriptase to make DNA

UCSC Genome Browser

Genomic annotations are presented in the browser as Tracks

FGENESH

HMM-based gene structure prediction

Combiners

Intrinsic and extrinsic methods are used in combination

exon/intron borders are hard to predict

It is difficult for intrinsic (ab initio) gene-finding algorithms to predict protein-coding genes in eukaryotic genomic DNA because ________.

Who created Perl?

Larry Wall

Lexical variables

Private variables

The Gene Ontology (GO)

Providing controlled vocabularies for describing gene products in the domain of molecular biology

SQL

SELECT Attribute(s) FROM Table(s) WHERE Condition(s)

Major Problems of the Protein Structure Prediction

Secondary Structure Prediction, Fold Recognition, Homology Modeling

Database Schema

The logical structure, or the expression of the inter-relationships among the data

GBrowse

a Web-based application for displaying genomic annotations and other features

the unless control structure

a block of code is executed only when the condition is false

Attribute

a column (header) in a relation (table)

hash table

a data structure that can look up values by names called keys

Pfam

a database of common protein domains (multiple sequence alignments and HMMs)

unshift() function

adds new elements to the beginning of an array; reverse operation of shift() function

push() function

adds new elements to the end of an array; reverse operation of pop() function

Operator Precedence

determines which operations in a complex group of operations happen first (Perl follows math rules)

Structural Genomics

focuses on sequencing genomes and analyzing nucleotide sequences to identify genes and other important sequences such as gene regulatory elements

How much of human genomic sequence is identified and masked by RepeatMasker?

over 56%

The chomp() function

removes the newline in a string

Gene Prediciton

refers to the process of identifying the regions of genomic DNA that encode genes

Metabolome

refers to the sum total of all the low-molecular-weight metabolites produced by an organism, cell or tissue

chop() function

removes the last character of a string

Cross join operation

returns the Cartesian product of rows from tables in the join

What is Perl's only data type?

scalar

Intrinsic/ab initio methods

search for exons and introns based on signals and patterns in the genomic DNA

foreach loop

steps through a list of values, executing one iteration for each value

Relational Data Model

the conceptual basis of a relational database, which organizes data into relations (tables)

Print operator

takes a list of items and sends each as a string to the standard output (STDOUT)

sort() function

takes a list of values and sorts them in the internal character order (code point order for strings)

print operator

takes a scalar argument and puts it out to the standard output

shift() function

takes the first element off of an array and returns the value

pop() function

takes the last element off of an array and returns the value

What does the first line in Perl do?

tells the operating system where Perl is installed

The ENCODE Project

to understand the function of the human genome

Accessing outside the hash table gives ______.

undef

keys() function

yields a list of all the keys in a hash table


Conjuntos de estudio relacionados

Chapter 53 practice test, Biology Final Chapter 36, Chapter 34, Biology Chapter 52, Exam 3

View Set

Networking Chapter 7, Chapter 8, Chapter 9, Ch 10

View Set

1 tétel : A Habsburg Birodalom és Magyarország a 19. század elején (területi változások, államjogi kapcsolat, a birodalom és Magyarország államszervezete, a rendi országgyűlések működése)

View Set

Tax Planning: Tax Characteristics of Entities (Module 4)

View Set

Ch 16 (Suicide): pre-lecture quiz & chapter quiz

View Set

semiologie medicala si chirurgicala

View Set

TXTOKT02 - Lý thuyết xác suất và thống kê toán 1 (3TC)

View Set