Bioinformatics Midterm

Ace your homework & exams now with Quizwiz!

Entrez

"Coming in!" A text-based data retrieval system developed by the National Center for Biotechnology Information (NCBI) that provides integrated access to a wide range of data domains, including literature, nucleotide and protein sequences, complete genomes, three-dimensional structures, and more. Entez connects DB entries with neighboring and hard links

Which of the following is a valid variable in Perl? @10_numbers $number-10 %MyHashTable STDIN

%MyHashTable Because @ means array/list of elements, a name can't start with a number, $ is a scalar valuable, hyphens (-) are not allowed, and STDIN could be a file handle... Has to have %, @, or $ for a variable in Perl with the proper words after it

The human genome consists of how much repetitive elements of various kinds?

50%

If the hash table, %student_grades = ("Andrew", 90, "Jennifer", 80, "Patrick", 70), what is the value of $student_grades{"Jennifer"}?

80

Ontology

A set of terms, relationships, and definitions which capture the knowledge of a certain domain; terms are linked by relationships: Ex. GO (Gene Ontology), providing controlled vocab for describing gene products and consists of 3 unlinked hierarchies: molecular function, biological process, and cellular component) - *Many to Many relationship between genes and GO terms

Dotplot

A simple picture that gives an overview of the similarities between 2 sequences; doesn't provide a robust statistical measure for the alignment quality - Dotplot diagrams can be used to find regions of local alignment, repeats, insertions, or deletions; -For nucleotide sequences, +1 says a match and -1 says a mismatch -For protein sequences, PAM and BLOSUM are used as scoring schemes to measure the sequence similarity (BLOSUM62 is default scoring matrix because it's the most effective in finding all potential similarities)

Global vs Local Alignment Algorithms

-Global Alignment Algorithms compare 2 sequences along their entire length and are most applicable to highly similar sequences (based on the math technique called dynamic programming) -Local Alignment Sequences find the most similar regions in 2 sequences and are best for sequences that share some degree of similarity (uses the Smith-Waterman algorithm, but it's super slow so BLAST is a rapid, heuristic version of it)

Boolean Values

-if the value is a number, 0 means false and all other numbers mean true -if the value is a string the empty string " and '0' means false and all other strings mean true -the not operator is ! to get the opposite Ex. 5 #true ! 5 #false The boolean operators in PubMed and Entrez searches, for ex. are AND, OR, NOT

/n /t /" //

/n is a newline /t is a tab /" is a double quote (double quoted strings) // is backslash

Bioinformatics

The field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.

GO (gene ontology) results interpretation

The symbols + and - indicate over or underrepresentation of a term. P-value is the probability or chance of seeing at least x number of genes out of the total n genes in the list annotated to a particular GO term, given the proportion of genes in the whole genome that are annotated to that GO Term. The closer the p-value is to zero, the more significant the particular GO term associated with the group of genes is (i.e. the less likely the observed annotation of the particular GO term to a group of genes occurs by chance)

If you do a BLAST search using a protein query and the best match in the target data base has an E-value of 3.5e-15, what does the result mean for the homology of the 2 matched sequences?

The target sequence is a CANDIDATE for a homologous sequence (not 100% though), but the smaller the E-value, the better the probability of not getting a false positive

Database tier

This is how data storage is handled in the three-tier architecture of a database system

use strict Pragma in Perl

To impose some discipline in Perl programming, put the use strict pragma and now Perl will insist that you declare every new variable using the my operator: my ($m, $n);

Transcriptome Proteome Metabolome

Transcriptome = all RNA transcripts synthesized by an organism Proteome = entire set of proteins translated Metabolome = sum total of all the low-molecular-weight metabolites produced by an organism, cell, or tissue

Genetic Code

Unambiguous, 1 Start Codon (AUG) and 3 Stop codons (UAA, UAG, UGA), and Degenerate (multiple codons for most amino acids)

Which of the following is NOT one of the 3 genome browsers developed to facilitate human genome annotation? A) UCSC Genome Browser B) Ensembl C) NCBI Map Viewer D) UniProtKB

UniProtKB! It's the Universal Protein resource, a central repository of protein data created by combining the Swiss-Prot, TrEMBL and PIR-PSD databases. Data types. captured. Protein annotation.

Database Management System (DBMS)

Used to define, create, query, update, and administer databases. -Relational Data model = organizes data into relations aka tables (one to one, one to many, many to many, many to one) -Relation (Table) = contains rows aka tuples, and data in each row corresponds to a real world entity or relationship, and each column (header) is called an attribute and the value of an attribute is the key (primary key is the chosen/main key, and a foreign key is an attribute in 1 table that is the primary key of a different table)

Foreign key

An attribute in one table and the primary key of another table

Biological Databases (primary and secondary)

Archival (primary) databases contain experimental data, not modified by curators, can be highly redundant, Ex. GenBank, Protein Data Bank, etc. Derived (Secondary) databases contain info derived from primary and inferred from content analysis, often annotated by experts and curators, can contain computationally derived results, Ex. Pfam, PubMed, etc.

Secondary Databases

Ex. RefSeq or UniGene; an open access, annotated and curated collection of publicly available nucleotide sequences (DNA, RNA) and their protein products; non-redundant. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from viruses to bacteria to eukaryotes. For each model organism, RefSeq aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts.

Hidden Markov Models (HMMs)

Class of probabilistic models (loaded dice example) that have been widely used in gene prediction programs, including GENSCAN, FGENESH, and AUGUSTUS - generally applicable to time series or linear sequences

Computer Networks and the internet

Computer networks = set of computers that are connected and able to exchange data Internet = vast collection of interconnected networks Unifying principle = use the same protocol (TCP/IP) for network interconnection *Can locate a computer on the internet via its IP address, Domain name, or URL

Perl

Created by Larry Wall in 1980s; optimized for problems that are 90% text and 10% everything else. You type "perl my_program.pl" to run the Perl program in a MS/DOS command prompt

CASP program

Cristical Assessment of Structure Prediction (CASP) uses blind tests to judge the methods of protein structure prediction (structural biologists release the amino acid sequence months before the publication date of structure so predictors can submit models to CASP)

Federated databases

Current data, flexible architecture, and no data consolidation; but slower queries and data at the sources. EX - Entrez and IBM DiscoveryLink

Central Dogma of Molecular Biology

DNA --> Transcription --> RNA --> Translation --> Protein (in all organisms except some viruses with RNA genomes and using reverse transcriptase to make DNA)

Sequence File Format - FASTA

Definition line starts with a greater than sign > followed by a unique sequence identifier and short description (>NM_139058.1 Homo sapiens aristaless related homeobox)

UniGene

Derived from mRNA information

Database and software development

Development and implementation of tools that enable efficient access and management of different types of biological information

Which of the following provides an estimate of the number of false positives from a BLAST search? Bit Score, E-Value, Percent Identities, or Percent Positives

E-Value!

Primary Databases

Ex. GenBank; repositories for nucleotide sequence data from all organisms. Primary database because it houses original sequence data. Normally has high redundancy which is an issue with them.

Prokaryotic genomes

Size is less DNA and fewer genes than eukaryotic genomes, its coding capacity is compact with continuous genes, its genes are organized into operons, and prokaryotes often contain plasmids

Standard WHERE clause

Specifies the condition for restricting the rows returned by the query

SQL

Structured Query Language - contains SELECT statements, FROM clauses, and WHERE clauses, among other things

Intrinsic (ab initio) gene-finding algorithms

Struggle in predicting protein-coding genes in eukaryotic genomic DNA because exon/intron borders are hard to predict; AUGUSTUS is one of the most accurate programs for ab initio gene prediction

The ENCODE Project

The Encyclopedia of DNA Elements (ENCODE) is to understand the function of human genome; almost 500 scientists from 32 research groups, over 75% of genome is transcribed - has 8.4 million regulatory sites (twice as much DNA as protein-coding genes) and 23,000 protein coding genes (about 1.5% of the genome)

GenBank

NCBI GenBank contains an annotated collection of all publicly available DNA sequences (Ex. Could type PAX-6 gene into search and the nucleotide sequence would come up; accession version = NM_123456.7)

Filehandles in Perl

Name of an I/O connection for a Perl program, normally the names are in all uppercase letters <STDIN> reads from the standard input stream __END__ indicates the Perl code is finished and what follows is input data INFILE opens a new file handle for input OUTFILE opens a new file handle for output

NCBI

National Center for Biotechnology Infomation - creates public databases for the biological community

If you want to access text information about human diseases, what is the best database to visit? OMIM, PubMed, UniGene, or UniProtKB

OMIM! (UniGene is used to check expression... and UniProtKB is a major source for high quality protein sets, made of 2 parts: SwissProt and tREMBL)

Protein Data Bank (PDB)

The RCSB PDB has been the primary repository for protein structures - contains 125,795 structures so far

UniProtKB

The Universal Protein resource! A central repository of protein data created by combining the Swiss-Prot, TrEMBL and PIR-PSD databases. Data types. Captured. Protein annotation. Data resources include Swiss-Prot, TrEMBL, and PIR (Protein Information Resource), Proteomes.

PubMed

The best database for retrieving biomedical literature information; its the literature component of Entrez at NCBI

chomp()

The chomp() function removes the newline in a string and if there is no newline, chomp() does nothing

P-Value

The closer the p-value is to zero, the more significant the particular result is (i.e. the less likely the observed annotation of the particular result has occurred by chance)

Hash table

a hash table is a data structure that can look up values by keys; refer to the entire table by using the % sign. the hash table holds key-value pairs and hash keys are unique strings; better than arrays! Ex. (arrays) @students = ("Tim", "Jim"); @grades = ( 90, 95); (hash table) %student_grades = ("Tim", 90, "Jim", 95);

BLASTN can be used to search:

a nucleotide query against a nucleotide sequence

BLASTP can be used to search:

a protein query against a protein sequence database

TBLASTN can be used to search:

a protein query against a translated nucleotide sequence database - (used to translate every DNA sequence in a database into 6 potential proteins, and then to compare your protein query against each of those translated proteins)

BLASTX can be used to search:

a translated nucleotide query against a protein sequence database - (translates a DNA sequence into 6 protein sequences using all 6 possible reading frames and then compares each of these proteins to a protein database)

Arrays

an array is a vaiable that conatinas a list of data elements and each element holds scalar value (starts at zero and increases by 1 for each element); uses the @ sign to refer to the entire array: $students[0] = "Andrew"; $students[1] = "Jennifer"; $students[2] = "Patrick"; @students = ("Andrew", "Jennifer", "Patrick");

JIGSAW

an integrative gene prediction program that combines the outputs from other gene finders, splice site predictors and sequence alignments.

Computational biology

analysis and interpretation of various types of biological data, including nucleotide and amino acid sequences, protein domains, protein structures, etc.

Standard SELECT statement

can include a FROM clause to indicate the tables to retrieve data from

Whole genome shotgun sequencing

it has become the most commonly used strategy for sequencing and assembling an entire genome; many software programs are available for assembling shotgun reads into contigs.

Eukaryotic genome

larger, contains introns/exons, only 2-10% of genome is protein-coding sequences and repetitive sequences comprise >50% of genome. Gene regulatory sequences (promoters and enhancers), pseudogenes (dysfunctional gene copies with significant mutations, usually not transcribed), and noncoding RNA genes (rRNAs, tRNAs, microRNAs, etc.)

TBLASTX can be used to search:

most computationally intensive because it translates DNA from both query and a database into 6 potential proteins, then performs 36 protein-protein database searches

RepeatMasker

most widely used tool for characterizing repetitive DNA; can identify SINE, LINE, LTR, and DNA transposons (it searches a DNA sequence query against curated libraries of repeats such as Repbase (database of prototypic sequences of repetitive DNA from diff eukaryotes) and Dfam (collections of sequence alignments and Hidden Markov Models/HMMs of transposable elements)

print operator

the print operator takes a scalar argument and puts it out to standard output: print "The answer is "; print 10 * 2; print "./n"; #the output is: The answer is 20. followed by a newline

Database Development Process:

1) Requirement analysis 2) Data modeling 3) Database construction 4) Application development

Architecture of a Database System:

3-tiers: 1) Interface Tier (web interface, client-side) 2) Application Tier (app logic, database connection, server-side) 3) Database Tier (data storage, handling queries)

In a BLAST search, what type of P-value do you want?

A LOW P-value! Because that means a lesser chance that the results are due to chance.

TrEMBL

A computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT.

SWISS-PROT

A curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc), a minimal level of redundancy and a high level of integration with other databases.

Relational database

A foreign key (an attribute in one table and the primary key of another table) is used to establish a link between the data in 2 tables

PAX-6 Genes

A master regulatory gene controlling a complex cascade of events in eye development - mutations cause aniridia, where the iris of eye is absent or deformed

Intrinsic gene-finding algorithms

Ab Initio gene prediction is an intrinsic method based on gene content and signal detection; the genomic DNA sequence alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either signals, specific sequences that indicate the presence of a gene nearby, or content, statistical properties of the protein-coding sequence itself. Ab initio gene finding might be more accurately characterized as gene prediction, since extrinsic evidence is generally required to conclusively establish that a putative gene is functional... Use genomic sequences based on input using Augustus and FGenesh

Genome annotation

Analysis of repetitive DNA, gene prediction, and gene functional annotation (sequence similarity analysis, protein domain analysis which use HMMs, Gene Ontology annotation)...

You have 2 distantly related proteins... which BLOSUM or PAM matrix is best suited to compare the 2 protein sequences?

BLOSUM45 or PAM250 (because PAM is mutations so you would want higher mutation rate, and BLOSUM is the opposite)

If you want to align 2 protein sequences that are very similar, which BLOSUM or PAM matrix would be most appropriate?

BLOSUM90 or PAM30 Because high BLOSUM means more similarity and PAM (Point Acceptance Mutations) should be low for less mutations and therefore more similarity.

BLAST

Basic Local Alignment Search Tool (BLAST) is a rapid, heuristic version of the Smith-Waterman algorithm, which starts database searching with query words.

BLAST

Basic Local Alignment Search Tool: an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. Most widely used method for sequence similarity analysis (can identify homologous genes that are evolutionarily related) - starts with a dotplot - PSI BLAST improves database search sensitivity (detection of true positives) and selectivity (rejection of false positives)

UCSC Table Browser

Can be used to query and retrieve genomic annotation data in text format

The Gene Ontology Hierarchies:

GO hierarchies include: Molecular function, biological process, and cellular component

Which of the following is NOT one of the Gene Ontology (GO) hierarchies? Molecular Function, Biological Process, Cellular Component, or Genetic Map

Genetic Map

NCBI Entrez

Good example of the database federation approach for data integration

HTML, XML, and NLP

HTML = HyperText Markup Language (used to create webpages) XML = EXtensible Markup Language (rules for encoding documents, used in info retrieval systems) NLP = Natural Language Processing, enables computers to derive meaning from natural language input

3 types of gene-finding algorithms

Intrinsic Extrinsic Combined

Protein Structure Prediction

Major problems to devising algorithms to accurately predict protein structures: -secondary structure prediction (predicting alpha and beta sheets) -fold recognition -homology modeling (predicting the 3D structure based on known structures of homologous proteins)

Gene Ontology

One of the main uses of the GO is to perform enrichment analysis on gene sets. For example, given a set of genes that are up-regulated under certain conditions, an enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for that gene set.

OMIM

Online Mendelian Inheritance in Man. An online catalog of human genes and genetic disorders. (Ex. can type Fragile X into search to learn about the disease)

Biological Networks

Physical Networks Logical Networks

PSI-BLAST

Position-Specific Iterated BLAST is particularly well suited for identification of distantly related proteins through constructing a Position-Specific Scoring Matrix (PSSM) to facilitate the search

iHOP system

Processes PubMed abstracts and generates a hyperlinked set of data to identify protein interactions

Structured Query Language (SQL)

SELECT Attribute(s) FROM Table(s) WHERE Condition(s) Ex. SELECT Amino_acid, Three_letter FROM AAtable, DGtable WHERE (AAtable.Distal_ group = DGtable.Distal_ group_ AND (H_bond_ donor = "Yes" )

Scalar Variables in Perl

Scalar variable names consist of $ and an identifier - an identifier has to begin with a letter or underscore $name $Name $_n2 $A_very_Long_varable_name

UCSC Genome Browser

Supports both text-based query and sequence-based search (BLAT); organizes genomic annotations into tracks, each track provides a different type of data like CpG islands and SNPs, AND tracks allow users to submit their own data as custom annotation tracks, and to change the presentation of the available tracks.

Pfam

a database of common protein domains (multiple sequence alignments and HMMs); used in protein domain analysis

Database Schema

The logical structure or expression of the interrelationships among the data, Ex: STUDENT (SSN, Name, Major, Birthdate) ENROLL (SSN, CourseID, Year, Semester, Grade) COURSE (CourseID, CourseName, DeptNum) DEPARTMENT (DeptNum, DeptName, Office, Head) -DeptNum is the primary key for DEPARTMENT and the foreign key for the COURSE table

A relational database

The primary key of a table is an attribute (or set of attributes) that has a unique value for each row in the table

Genome Annotation

The process of identifying genes, their regulatory sequences, their functions - relies heavily on bioinformatics

Theoretical bioinformatics

development of new algorithms and statistics to assess relationships among members of large data sets

Database Warehouse

first consolidates all the data from different sources into a local database and then utilizes the local database for fast queries; uses cleaning, loading, etc. to improve the quality of the data - it can be faster because you keep data all in one local database - EX. UniProt, UCSC Genome Browser, FlyBase

Gene Prediction

refers to the process of identifying the regions of genomic DNA that encode genes (Programs incude GENSCAN, FGENESH, Augustus, etc.)

chop()

the chop() function removes the last character of a string

#!/usr/bin/perl

tells the operating system where Pel is installed (most Perl statements end with a semicolon ; and comments run from a # sign to end of the line

Scalar Assignment

the = sign is the Perl assignment operator $fred = 8; #Give $fred value of 8 $barney = $fred + 2; #Give $barney value 10 $fred - = 5; # $fred = $fred-5 now which is 3

Extrinsic gene-finding algorithms

to collect extrinsic evidence for most or all of the genes in a complex organism requires the study of many hundreds or thousands of cell types, which presents further difficulties... the RefSeq database contains transcript and protein sequence from many different species, and the Ensembl system comprehensively maps this evidence to human and several other genomes (data presented as views, which each view showing a different level of detail). It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data, and there are new high-throughput transcriptome sequencing technologies such as RNA-Seq and ChIP-sequencing... Major challenges involved in gene prediction involve dealing with sequencing errors in raw DNA data, dependence on the quality of the sequence assembly, handling short reads, Frameshift mutations, overlapping genes and incomplete genes. With extrinsic ones, use additional information such as protein sequences, nucleotide (genomic) sequences, etc. using BLAST to find the intron/exon junction.


Related study sets

International Marketing chapter 8

View Set

IB Business Management Unit 1 - Key Definitions

View Set

Chapter 18 Fill in the Blank Quiz

View Set

Ch. 14 - Landlord and Tenant Relations

View Set