Bioinformatics
Transcriptional Gene elements
(-35) RNA polymerase recognition site (important for sigma factor) (-10) RNA polymerase binding site Pribnow box or TATA box [prevalent in prokaryotes] +1 Transcriptional start site Terminator
DELTA-BLAST
(Domain Enhanced Lookup Time Accelerated) -Position-specific iterated BLAST -Advantageous with shorter sequences -Searches already pre-constructed matrices for comparison first (uses PSSMs that are already known and in a large database) -Uses multiple sequence alignments of conserved domains
WebLogo:How is a good and informational result assessed in this database?
- Multiple Sequence Allignment, Visual representation, allows to determine patterns, Analysis: Base position: Arranged in frequency from top to bottom of the stack Overall height of the stack: Most important positions = taller stacks, Highly conserved = taller stack Height of the letters within each stack: Relative frequencies for each base Width of the stack: Few gaps = larger width small sequences, compared to enough other sequences of long enough sequence, speciically sections of ntrest
How is a good and informational result assessed in this database? PSI-BLAST
- Once a single sequence from a highly conserved family (here, the DNA ligases) is used in constructing a profile, the rest of the family will almost certainly be retrieved (and have E-values of high significance) in subsequent iterations. Impressive E-values for sequences retrieved in later iterations depend upon the validity of earlier inferences and therefore should not be taken as automatic proof of homology.
What are the strengths vs. weaknesses of blast
-? a search method that permits speedy database searching, not for determining distang homolog, own protein can come up, not as helpful is only yotheticals matched, a search method that permits speedy database searching, segments the query sequence into pieces ("words") ,Default word length: 3 amino acids or 11 na's, Creates a list of scores for comparing query words to target words, Uses scoring matrix to calculate scores for words that might be found in the database, Saves the scores that exceed a given threshold T, scans the database for matches to the word list with acceptable T values
What biological patterns have predictive power? Are some better than others?
A pattern is an assumption, but it can be an informational assumption. translational transcriptional patterns, sequences found in almost all signaling encoding, some better, ore constant, less variation (eg SD almost never matches consensus), protein patterns of shape based on pattern, amino acid sequence predict function, folding, can't fully be known without visualization though, pI suggestive of location within cell, domain and motif patterns
TMHMM: -What are the strengths vs. weaknesses of this database?
Advantages High Accuracy - 97-98% accurate on TM helices Only need FASTA sequence of protein of interest Fast Disadvantages Signal Peptides - Contain long hydrophobic chains that are often mistaken for transmembrane helices Porins - beta barrels could be mistaken for TM helices Topological Inversions TM and non-TM proteins Combines Helices
Rhodobacter capsulatus
Alphaproteobacteria Gram-negative Purple Nonsulur Bacteria Photosynthetic Produces Spheroidenone and Hydroxyspheroidenone. These two products may or may not be produced in extorquens, it is currently unknown as one enzyme in the pathway has not yet been identified. This is a comparable organism to compare extorquens to for decisions about CrtA. The genome encodes genes for photosynthesis, nitrogen fixation, utilization of xenobiotic organic substrates, and synthesis of polyhydroxyalkanoates. These features made it a favorite research tool for studying these processes.During photosynthetic growth, R. capsulatus exhibits several unique properties including the formation of an intracytoplasmic membrane system as well as the synthesis of various metal-containing cofactors. These properties make R. capsulatus a promising expression host particularly suited for difficult-to-express proteins such as membrane proteins.
Ortholog vs. Paralog
Ortholog • Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes. (See also Paralogs.).Paralog • Paralogs are genes related by duplication within a genome. Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one. Orthologs: diverged after speciation, tend to have similar function • Paralogs: diverged after gene, duplication, some functional divergence occurs
Bradyrhizobium sp. ORS278
Pink or Orange Coloration Alphaproteobacteria Gram negative tropical bacterium, which is photosynthetic, used in nitrogen fixation and in symbiotic relationships with plants Carotenoid Synthesis pathway slightly more expansive, more possible products produced. Produces Spirilloxanthin, one of the pigments likely produced by extorquens. Photosynthetic Bradyrhizobium strains possess the unusual ability to form nitrogen-fixing nodules on a specific group of legumes in the absence of Nod factors.
What do e-values tell you? What is a "good" e-value?
The E-value is defined as the odds that the sequence alignment (pairing) of your query sequence with the database "hit" could have happened by chance Values less than 10-3 are significant, Values equal to or less than 10-15 can indicate a good match
PAM
percent accepted mutation PAM matrix is a matrix where each column and row represents one of the twenty standard amino acids. In bioinformatics, PAM matrices are regularly used as substitution matrices to score sequence alignments for proteins. Each entry in a PAM matrix indicates the likelihood of the amino acid of that row being replaced with the amino acid of that column through a series of one or more point accepted mutations during a specified evolutionary interval, rather than these two amino acids being aligned due to chance. Different PAM matrices correspond to different lengths of time in the evolution of the protein sequence. : Similarity scores were based on closely related proteins and using a global alignment, empirically derived for close relatives
DELTA-BLAST: - -What are the strengths vs. weaknesses of this database?
positives most positive results, takes a little moe time, more sensitive DELTA-BLAST also detects homologous relationships in a larger number of SCOP superfamilies than do the other search programs. It is surprising that multiple iterations of DELTA-BLAST perform worse than does a single one
PSORT-b : What are the strengths vs. weaknesses of this database?
posotives precision (or specificity) over recall (or sensitivity) If you get result, pretty good chance it'll be right Good Specialization Gram negative Quick and Easy negatives Only gives definitive answers Could be multiple localization sites Hypotheticals probably not going to be found
Shine-Dalgarno (SD)
sequence upstream of a start area, can be used to determine protein ours was determined to be approximately: GAGAGGA, and ATG 6-7 gap between SD and start, The Shine-Dalgarno (SD) sequence is a ribosomal binding site in prokaryotic mRNA, generally located around 8 bases upstream of the start codon AUG.[1] The RNA sequence helps recruit the ribosome to the mRNA to initiate protein synthesis by aligning the ribosome with the start codon., helped to predict position of gene
TMHMM: -How is a good and informational result assessed in this database?
sequences with transmembrane sections will be revealed by coloration, described,
BLAST
pairwise alignment Basic Local Alignment Search Tool, Compares a sequence of choice to a database of tons of different sequences that have just been identified (possibly not even known function but know they exist) -Generates pairwise alignments only
role in locations for protein function
. Cytoplasm 2. Inner Membrane 3. Periplasm 4. Outer membrane 5. Extracellular matrix
Understand the basic metabolic pathway that M. extorquens uses to make a living of methanol
. M. extorquens is utilized aerobic respiration and gets both its energy and carbon source from methanol. One-carbonmetabolism of M. extorquens. Formaldehyde is produced in the periplasm of the cell from methanol and is transferred into the cytoplasm. Part of the formaldehyde is oxidized to CO2, and part is assimilated via the serine cycle. mxaF,I, genes encoding large and small subunits of methanol dehydrogenase
What is Brownian motion
? Brownian motion forces allow to change direction, Brownian motion or pedesis (from Greek: πήδησις /pɛ̌ːdɛːsis/ "leaping") is the random motion of particles suspended in a fluid (a liquidor a gas) resulting from their collision with the quick atoms or molecules in the gas or liquid.
M. extorquens is a facultative methylotroph. What does this mean?
Able to grow on multiple types of carbon, C1, C2, C3, C4, C6?
What is curation
1. ? Content curation is the process of sorting through the vast amounts of content on the web and presenting it in a meaningful and organized way around a specific theme. The work involves sifting, sorting, arranging, and publishing information. Data curation is a term used to indicate management activities required to maintain research data long-term such that it is available for reuse and preservation. In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database
Role in locations for protein function
1. Cytoplasm 2. Inner Membrane 3. Periplasm 4. Outer membrane 5. Extracellular matrix
Bacterial Genome Patterns based on lifestyle, where is M. extorquens in this list?
1. Free Living 2. Recent or facultative pathogen (symbiont that can associate with host or be free living) 3. Obligate symbiont or pathogen (STDs have to be passed host to host, can't be free living outside host) #2 can exist with plants, dirt
Place the following elements in order (using leftright designation): Shine-Dalgarno; stop codon; 3' end; A/G (TSS transcription start site on DNA); AUG/GUG (start codon); 5' end
5'.....A/G (TSS on DNA).......SD......AUG (mRNA start)......[translated].....(mRNA stop).......3'
BLOSUM
: Similarity scores were based on more "distantly" related proteins (based upon percent) using local alignments, empirically derived for distant relatives, In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used forsequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Henikoff and Henikoff.[1] They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities.
Curated
: a genome sequence organized by someone
COBALT
: constraint based alignment tool , Multiple alignment -Local alignments to make a multiple sequence alignment -Can be used to narrow down possible homology -Aligns the proteins that you enter, including the self protein
M. extorquens is an epi- and endophyte of plants. Why
? An endophyte is an endosymbiont, often a bacterium or fungus, that lives within a plant for at least part of its life without causing apparent disease, An epiphyte is a plant that grows harmlessly upon another plant (such as a tree), and derives its moisture and nutrients from the air, rain, and sometimes from debris accumulating around it instead of the structure it is fastened to., plant leaves produce methanol, which extorquens then can run off of
Gene
A gene is the molecular unit of heredity of a living organism. It is used extensively by the scientific community as a name given to some stretches of deoxyribonucleic acids (DNA) and ribonucleic acids (RNA) that code for a polypeptide or for an RNA chain that has a function in the organism The basic unit of heredity, composed of DNA An ordered region of nucleotides located on a DNA molecule that encodes a specific functional product, either RNA or protein.
Why do plants excrete methanol?
Analysis of the emissions of volatile organic compounds from leaves has revealed that most plants emit methanol, especially during early stages of leaf expansion — it is probably produced as a by-product of pectin metabolism during cell wall synthesis, and a fraction of this pool is then emitted through stomata during transpiration.
PSI-BLAST: - - -What are the strengths vs. weaknesses of this database?
BIGGEST problem with PSI-BLAST: FASLE POSITIVES STOP THE CORRUPTION: 1. Apply a filter: Under Algorithm Parameters, Filters and Masking 1. Lower the Expect value: Default: E=0.005 Lower: E=0.0001 3. Visually inspect each PSI-BLAST hit with other informatic tools... • distant homologues may share limited sequence identity • may adopt same three-dimensional structures that get lost in pair-wise alignments of primary structure • useful to detect weak but biologically meaningful relationships between proteins PSI-Blast can beat Blastp if Blastp finds some reliable alignments to database sequences. (Moderately distant matches are particularly useful.) Then, PSI-Blast (which starts by running Blastp) can determine which positions in the query sequence are conserved during evolution and devise an appropriate Position-Specific Scoring Matrix, which can be used to identify relatives at a further evolutionary distance. If the original Blastp run cannot find any reliable alignment, PSI-Blast is powerless constructs multiple sequence alignment from blastp results, biggest problem false problems
SignalP: - What are the strengths vs. weaknesses of this database
Better understanding of TP, TN, FP, FN that previous version ● Lower sensitivity than previous versions
SignalP: -How is a good and informational result assessed in this database?
C-Score ◦ Raw cleavage site score ● S-Score ◦ Signal peptide score ● Y-Score ◦ Combined cleavage site score ● Mean S ◦ Average S-score of possible signal peptide ● D-Score ◦ Discrimination score. Weighted average of the mean S and Y-max score, ◦ ◦ will tell you if there I a signal peptide Other databases might tell you s or c, but combination of y is what really allows you to determine using this database, value higer than cutoff, .57? signal peptide
Why is M. extorquens interesting from an industrial stand point?
Carotenoid production may be utilized for various "industrial" uses, the unique ability of Extorquens to metabolize methanol is attractive for its possible industrial applications, use to producevariou products (ex butanol) from 1 and 2 C feed, simple inexpensive growth requirments and can produce useful proteins, can be utilized
CDD
Conserved Domain Database -Identified putative motifs (active sites, binding, interaction sites, structure) -Curated set of information -Tells you about where the protein comes from, where it diverged, and tells about its family and subgroups -Hits give you domains of proteins, not proteins themselves -Tells you about your protein's function and whether it has the important sites that it needs -DUF (domain of unknown function): still useful to know the superfamily, because this tells you that your protein is conserved through evolution
Conserved hypothetical
Conserved domains are available in the hypothetical proteins which need to be compared with the known family domains by which hypothetical protein could be classified into particular protein families even though they have not been in vivo investigated. hose that are found in organisms from several phylogenetic lineages but have not been functionally characterized, conserved proteins whose functions are still unknown,
Coverage
Coverage is defined as the number of short reads that overlap each other within a specific genomic region. For example, a 30-fold coverage forCYP2D6 gene means that every nucleotide within this gene region is represented in at least 30 distinct and overlapping short reads. Sufficient coverage is critical for accurate assembly of the genomic sequence. reads per nucleotide or distinct sequence of nucleotides
Domain vs. Motif
Domain: A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function Motif: A short conserved region in a protein sequence.Motifs are frequently highly conserved parts of domains. Motifs are short sequences and domains are longer ones so, as Eileen said, a domain can contain several motifs. Domains have a function within the protein. What is the difference between motif and domain?. Motifs are structural characteristics and domains are functional regions (not necessarily related to size). In a protein, a particular arrangement of amino acids or secondary structure that can be found in other proteins (not necessarily evolutionarily related) can be called a motif. If that particular arrangement is related to some function (DNA or protein binding, catalytic, etc.) then it is a domain.
How a bit score is used to calculate an e-value:
E-value (expect value): calculated from bit scores (lower the better) E= mn X 2-S' m = length of query sequence n = size of the database -S' = bit score
COBALT -How is a good and informational result assessed in this database?
For scoring multiple alignments, we can use this formulation by checking how each sequence S scores against the alignment for the rest of the sequences. Because cij includes the count for S in the multiple alignment for S, the count for cij should be decremented by one while scoring S. Also, since columns with gaps will not have ∑ j cij = N , COBALT is a flexible tool for simultaneously aligning a given set of protein sequences, where users can directly specify pairwise constraints and/or ask COBALT to generate the constraints using sequence similarity, (optional) CDD searches and (optional) PROSITE pattern searches. COBALT will optionally create partial profiles for input sequences based on any CDD search results. Two alignments are said to overlap in this context, if their range on Si overlaps
Flagellar structure in gram-negatives.
Function as an outboard motor by rotating, literally a rotor. Not a whip.
G+C content
GC % can help determine whether a protein will be produced, In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage ofnitrogenous bases on a DNA molecule that are either guanine or cytosine, , higher melting temp, this organism has a high GC%
Proteobacteria
Gamma, Beta, Alpha, Epsilon, Delta are Classes of Alphaproteobacteria grow at very low levels of nutrients and have unusual morphology such as stalks and buds. Betaproteobacteria often use nutrient substances that diffuse away from areas of anaerobic decomposition of organic matter (hydrogen gas, ammonia, methane) and includes chemoautotrophs. Gammaproteobacteria are the largest subgroup Deltaproteobacteria include bacteria that are predators on other bacteria and are important contributors to the sulfur cycle. Epsilonproteobacteria are slender Gram-negative rods that are helical or curved.
How are genomes initially annotated and curated?
Genes are identitied, function preditied, metabolic reconstructions developed tied to speiic genes, insertions prophages etc labeled, frameshifts and pseudogenes predicted, regulatory sites and operons are identified, initial annotation only loation and predicted function, curaed by updateing with new information as it is added, curation of genomes happens seldomly, much initial annotation either comes rom computer generation with new genome sequences or epert analysis which can lead to curation.
T-COFFEE: -What are the strengths vs. weaknesses of this database?
Strengths Improved accuracy Less errors in the earlier stages Weaknesses Slower than COBALT and other alignment options Must know what sequences "greedy progressive method"
Global alignment vs. Local alignment
Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot end in gaps.) Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The very basic difference between a local and a global alignments is that in a local alignment, you try to match your query with a substring (a portion) of your subject (reference). Whereas in a global alignment you perform an end to end alignment with the subject (and therefore as von mises said, you may end up with a lot of gaps in global alignment if the sizes of query and subject are dissimilar) Global alignment is when you take the entirety of both sequences into consideration when finding alignments, whereas in local you may only take a small portion into account. Global: Whole sequence of both proteins is used Local: finds areas of greatest similarity between two proteins usually not the entire length
% Identity vs. % Similarity
Identity refers to an exact match between two nucleotides or amino acids, Similarity refers to a resemblance between two residues that is greater than one would expect at random. Identity: Extent to which residues in aligned sequences are invariant. Similarity: Extent to which residues in aligned sequences have similar properties. Need not have diverged from common ancestor. Identity is the degree of correlation between 2 un-gapped sequences, and indicates that the amino acids or nucleotides at a particular position are an exact match. Generally, an identity of 25% or higher suggests the potential for similarity of function; an identity of 18-25% implies similarity of structure or function. It is important to note that 2 or more completely unrelated sequences can have 20% identity or greater, so this is not a hard and fast rule. Similarity is the degree of resemblance between two sequences when they are compared, and indicates that the amino acids or nucleotides at a particular position have some properties in common (for instance, charge or hydrophobicity), but are not identical. A high percentage of similar residues can also suggest a conserved function or structure. Identity: same amino acid at same position Similarity: amino acids that have similar characteristics charge, size, etc. at same position
INSDC
International Nucleotide Sequence Database Collaboration, joint effort to collect and disseminate databases containing DNA andRNA sequences.[1] It involves the following computerized databases: DNA Data Bank of Japan (Japan), GenBank (USA) and the European Nucleotide Archive (UK). New and updated data on nucleotide sequences contributed by research teams to each of the three databases are synchronized on a daily basis through continuous interaction between the staff at each the collaborating organizations.
Operon
In genetics, an operon is a functioning unit of genomic DNA containing a cluster of genes under the control of a single promoter.[1 genetic regulatory system found in bacteria and their viruses in which genes coding for functionally related proteins are clustered along the DNA
Why is doing this work important for M. extorquens? For other species?
M. extorquens is a realitvely new bacterium with little work about much of its functions or genes, aside from the fact that it is sequences, It is valabel then to urthur understand the strains as well as to predict functions without doing extensive wet lab research. It is also valuable because of some of the possible applications, as extorquens is uniue in its ability to grow of of methanol, suggesting possible inducstrial uses, and its symbiotic relationship with plants, and possible action as a chemotasis. This is also important in other species for the general same reason, with newer organisms it allows us to gather information without extensive hands on research, but instead we can further understand through prediction and dry lab work about new organisms through comparson of what is lready known.
Meaning of the name/alternative names
Methyl = methane extorquens = to twist out/wrench out, bacillus, vibrio, psuedomonas, flavobacterium, protomonas extorquens
What are MCPs? How do the cells adapt?
Methyl-accepting chemotaxis protein (MCP) is a transmembrane sensor protein of bacteria. Use of the MCP allows bacteria to detect concentrations of molecules in the extracellular matrix so that the bacteria may smooth swim or tumble accordingly. If the bacteria detects rising levels of attractants (nutrients) or declining levels of repellents (toxins), the bacteria will continue swimming forward, or smooth swimming. If the bacteria detects declining levels of attractants or rising levels of repellents, the bacteria will tumble and re-orient itself in a new direction. In this manner, a bacteria may swim towards nutrients and away from toxins[2] CheY-Phosphate is the tumble genearator, not smotth swimming as seen in chemotaxis, to swim smoothly dephosphorylation of CheY required, which means ChA dephosphorylation required, CheR puts on methyl group, CheB can dephosphroyalte (after three minutes) Located in the inner (cytoplasm) membrane, since this is here ths must mean that for at least some signlas there is a protein allowing them to pass through the outer membrane. Signaling domain must interact with CheA and CheW, CheW helps with phosphorylation of CheA? Want to block phosphorylation, so CheY phsophrolyation not, so we see a smooth wlk in response to attractant. Methyl chemotaxis protein methylatied for memory, for three minutes methylated (example?) helps to hold conformational change,
What is the biased random walk?
Movement is random, the bacteria doesn't know the gradient or where there is attractant(sugar, amino acids...), but once they encounter some they stop they random go, stop, reverse stuff and swim smoothly straight for a while, Bacteria stop moving when flagella fly apart, This biased random walk is a result of simply choosing between two methods of random movement; namely tumbling and straight swimming.[4] In fact, chemotactic responses such as forgetting direction and choosing movements resemble the decision-making abilities of higher life-forms with brains that process sensory data.
T-COFFEE: -How is a good and informational result assessed in this database?
Multiple Sequence Alignment, coloration Blue/green - red) suggest how conserved a residue is, red being "red hot" Uses both local and global pair-wise alignment
WebLogo:
Multiple alignment -Graphical display of a multiple sequence alignment -Gets you a general consensus sequence, allows you to see the dominant residues in all positions and the frequencies of those residues
NCBI
National Center for Biotechnology Information The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences andPubMed,
What is the "vocabulary problem" in annotation?
No established organization involved in the standardization of protein names, nor any effects that are valid across species, nomenclature important for communication literature searching and entry retrieval, consistency has been attempted, but there are disagreements, subtle differences, which then can cause bioinformati analysis to be skewed or effected due to different names for the same thing
ORF vs. CDS
ORFs: The region of the nucleotide sequences from the start codon (ATG) to the stop codon is called the Open Reading frame. Gene finding in organism specially prokaryotes starts form searching for an open reading frames (ORF). An ORF is a sequence of DNA that starts with start codon "ATG" (not always) and ends with any of the three termination codons (TAA, TAG, TGA). The Coding Sequence (CDS) is the actual region of DNA that is translated to form proteins. While the ORF may contain introns as well, the CDS refers to those nucleotides(concatenated exons) that can be divided into codons which are actually translated into amino acids by the ribosomal translation machinery. ORF = open reading frame, a DNA sequence with the potential to encode a protein- begins with a start codon & ends with a stop codon CDS= coding sequence, an open reading frame that is actually expressed and encodes a functional product Mainly: CDS means only that the sequence is known to be transcribed and, therefore, it is coding for something -- neither gene nor protein has to be known. Any full mRNA sequence (obtained from cDNA sequencing) will have a full coding sequence. ORF is usually predicted based on DNA sequence and not proven to be transcribed.
Type II methylotroph for carbon assimilation,
On the other hand, type II methanotrophs are part of the Alphaproteobacteria and utilize the Serine pathway of carbon assimilation
WebLogo:-What are the strengths vs. weaknesses of this database?
Strengths Gives more information than Consensus Sequences Provides good graphic representation Helps identify patterns Weaknesses Less useful with limited sequences (<20 nucleotide/40 proteins) Sequences of identical length required Relative sizing can make visualization more difficult
(PSSM):
Position Specific Scoring Matrix s a commonly used representation of motifs (patterns) in biological sequences. PWMs are often derived from a set of aligned sequences that are thought to be functionally related and have become an important part of many software tools for computational motif discovery. Each coefficient in this matrix indicates the number of times that a given nucleotide has been observed at a given position. Position-Specific Scoring Matrix, is a type of scoring matrix used in protein BLAST searches in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment Positive scores indicate that the given amino acid substitution occurs more frequently in the alignment than expected by chance, while negative scores indicate that the substitution occurs less frequently than expected. Large positive scores often indicate critical functional residues, which may be active site residues or residues required for other intermolecular interactions.
PSI-BLAST
Position-specific iterated BLAST -Multiple sequence alignment Differs from DELTA BLAST in that it makes multiple sequence alignments but not of conserved domains already made (we generate the PSSM)
SignalP
Predicts presence and location of signal peptide cleavage sites in amino acid sequences A signal peptide (sometimes referred to as signal sequence, targeting signals, localization signals, localization sequence,transit peptides leader sequence or leader peptide) is a short (5-30 amino acids long) peptide present at the N-terminus of the majority of newly synthesized proteins that are destined towards the secretory pathway.[1] Often ater a hydrophobic sequence and an A amino acid ending, polar after
Rhodomicrobium vannielli
Produces a red pigment, closely related biochemically to the nonsulfur purple bacteria, Has been shown to produce both lycopene and spirilloxanthin, two of the products that can be produced by extorquens Gram negative alphaproteobacteria phototrophic bacterium with a unique and complex cycle of development, characterized by three different cell types, budding cells, swarmer cells, and exospores Enzymes present in the terpenoid pathway appear identical (in identity) to extorquens, More expansive carotenoid pathway, more carotenoid products able to be produced, Rhodomicrobium vannielii is distinguished by its unique mode of reproduction and its internal system of membranes. Although membranous systems are, under certain conditions, to be found in other photosynthetic bacteria, they are hardly comparable to the well developed membrane system which is a constant feature of the cells of R. vannielii.
What are three possible outcomes of an evolved protein (through various mutation types)?
Proteins change through: Frequent amino acid substitutions 2. Rare amino acid substitution 3. Gaps, stop codon; deletion; insertion; frameshift, Original gene 1. gains a new, but similar function 2. Loses function: genetic drift, pseudogene Duplicated gene 1. retains original function 2. gains a new, but similar function 3. Loses function: genetic drift, pseudogene
Pseudogene vs. Hypothetical
Pseudogenes are genomic DNA sequences similar to normal genes but non-functional; they are regarded as defunct relatives of functional genes. a region of DNA with homology to known genes that has lost its ability to encode a functional product In biochemistry, a hypothetical protein is a protein whose existence has been predicted, but for which there is no experimental evidence that it is expressed in vivo, These genes, which have not been experimentally characterized and whose functions cannot be deduced from simple sequence comparisons alone
PSORT-b : -How is a good and informational result assessed in this database?
Rating 7.5 considered to be a good cutoff above which a single localization can be assigned Score between 4.5 and 7.49 (Gram -) Result comes back as unknown but probably because of multiple localization points Use own predictions
How to calculate a bit score:
Raw SCORE (S): calculated by counting number of identities, mismatches, gaps and "-" characters (matrices) S =aI + bX - cO -dG I= number of identities (a is the reward) X= number of mismatches (b is the reward) O= number of gaps (c is the penalty) G= total number of "-" characters (d is the penalty) S' = bit-score: measure of statistical significance, normalizes raw S scores, that is independent of the search space... ability to compare S scores from different databases the bit-score S' is a normalized score expressed in bits that lets you estimate the magnitude of the search space you would have to look through before you would expect to find an score as good as or better than this one by chance.
RefSeq database in NCBI
Reference Sequence (RefSeq) database[1] is an open access, annotated and curated collection of publicly availablenucleotide sequences (DNA, RNA) and their protein products. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule, Curated NCBI created Changed as new data emerges Single records for each molecule from major organism Limited to model organisms Exclusive to NCBI Akin to "review articles" unique identification number for a complete RefSeq sequence record • format: two letters followed by an underscore and six digits NT_123456 • The first two letters of the RefSeq accession number indicate the type of sequence included in the record
CDD: -What are the strengths vs. weaknesses of this database?
Residue Conservation and Divergence -Inferred functional properties, Advantages Anything is better then nothing Limitations Short, sequence diverse (PSSM) Coverage vs. curation Conserved Domain Database: Helps find motifs Active sites, chemical binding, protein-protein interaction sites, Structural motifs
Sigma factor
Sigma (σ) factors control the promoter selectivity of bacterial RNA polymerase (RNAP). On binding to RNAP, σ factors allow efficient promoter recognition and transcription initiation. works with RNA polymerase to start transcription, A sigma factor (σ factor) is a protein needed only for initiation of RNA synthesis.[1] It is a bacterial transcription initiation factor that enables specific binding of RNA polymeraseto gene promoters.
COBALT -What are the strengths vs. weaknesses of this database?
Strengths Can align two or more proteins that you want to look at. Does not give you hundreds of proteins that may be similar Can narrow down which potential homolog is the most likely Weaknesses Need to know what your protein does and possible homologs Does not show similarity Multiple sequence alignment, Compares two or more sequences and looks for similarity/identity
PSI-BLAST How does this database allow you to find organisms more distantly related with similar function
The PSSM captures the conservation pattern in alignment and stores it as a matrix of scores for each position in the alignment-highly conserved positions receive high scores and weakly conserved positions receive scores near zero. This profile is used in place of the original substitution matrix for a further search of the database to detect sequences that match the conservation pattern specified by the PSSM. The newly detected sequences from this second round of the search, which are above the specified score (e-value) threshold are again added to alignment the profile is refined for another round of searching. This process is iteratively continued until desired or until convergence, i.e., the state where no new sequences are detected above the defined threshold. Performs normal blastp 2. PSI-BLAST constructs multiple sequence alignment from blastp results (position specific scoring matrix PSSM) 3. Uses the PSSM as query 4. PSI-BLAST estimates statistical significance 5. Continued iteratively until no new hits are found, shows new alignments below thresholds on previous run
How is COBALT different from T-COFFEE
The use of local pairwise similarity present in multiple sequence pairs to highlight similar regions in otherwise divergent sequences. Local alignments can also constrain global alignment to improve performance, because the presence of a constraint reduces the size of the space that a dynamic programming implementation must search for an optimal pairwise alignment. Some algorithms, such as T-Coffee (Notredame et al., 2000) and DbClustal (Thompson et al., 2000), do use libraries of pairwise alignments, but they do not attempt to explicitly choose alignments present in multiple pairs
BLASTTypically uses BLOSUM 62. What does this mean?
lower blosum, lower pam less precise? decreased degree of sequence identity between the query and its target sequences, increased divergence, converse as well
Why are annotations predictions and not proof of function?
Thus, the annotation of new sequences is mostly by prediction through computational methods, can't turely know without experimentation Determining that a sequence is functional should be distinguished from determining the function of the gene or its product. Predicting the function of a gene and confirming that the gene prediction is accurate still demands in vivo experimentation[1] through gene knockout and other assays, although frontiers of bioinformatics research[citation needed] are making it increasingly possible to predict the function of a gene based on its sequence alone.
TMHMM
Transmembrane Hidden Markov Model (TMHMM) Parameters: Hydrophobicity, Charge, Helix Lengthf -Looks at the secondary structure of a protein and determines a possible transmembrane function -Functions: transporters, anchors, enzymes, and receptors -Parameters: Hydrophobicity - Inside of lipid bilayer are hydrophobic amino acid residues -Charge - Charged R group amino acid residues are on cytoplasmic or extracellular side of lipid bilayer. Positive charges are on cytoplasmic side -Helix Length - Lipid bilayer has fixed length of two lipids -Grammatical Constraints - Cytoplasmic and non-cytoplasmic loops have to alternate
TCoFFEE
Tree-based Consistency Objective Function For alignment Evaluation Multiple alignment -Local and global alignments (ClustalW and lalign) -Uses phylogeny, local alignment and global alignment to make the multiple sequence alignment
STOP Codons
UGA (U go away) UAA (U are away) UAG (U are gone)
Accession number (ACCN):
Unique identifier given to a sequence that can be tracked across different databases. An accession number is a unique identifier assigned to a particular genome or protein sequence to uniquely identify it in a database The GenBank accession number is a combination of letters and numbers that are usually one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456). The accession number for a particular record will not change even if the author submits a request to change some of the information in the record, unique identifier assigned assigned consecutively to the entire sequence record when the record is submitted to GenBank. • usually in the format of one/two letters followed by five digits M12345 (old) AC123456 (new) • will not change even if the author submits a request to change some of the information in the record. • a unique identifier for a complete sequence record, while a gi is an identification number assigned just to the sequence data. • NCBI Entrez System is searchable by accession number using the Accession [ACCN] search field.
How is a good and informational result assessed in this database? DELTA-BLAST
We annotated the test set and database sequences by using RPS-BLAST to compare them to CDD version 2.30. An E-value ≤ 0.01 yielded an association with a CD, example
What adjustments can be made to blast searches?
What will these adjustments do for your search? datatbase, organism excluded, algorhythim psi phi blast, max target sequences, expected threshold expected umber of chance matches in model, word size size of match, matrix blosum pam,,, gap cost, factor in to equation how much gap costs Increasing the Gap Costs will result in alignments which decrease the number of Gaps introduce,
CDD: -How is a good and informational result assessed in this database?
a domain match,
Substitution matrix
a substitution matrix describes the rate at which one character in a sequence changes to other character states over time. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where the similarity between sequences depends on their divergence time and the substitution rates as represented in the matrix. measures of similarity
gram-negative vs. gram-positive
and thick layer of peptidoglycan and are called Gram-positive bacteria no outer membrance low lipid content in wall A Gram positive bacterium has a thick, multilayered cell wall consisting mainly of peptidoglycan (150 to 500 A) surrounding the cytoplasmic membrane. , negative thin peptidoglycan layer has perplasmic space and outer membrane Gram negative cell wall contains two layers external to the cytoplasmic membrane. Immediately external to the cytoplasmic membrane is a thin peptidoglycan layer, which accounts for only 5% to 10% of the Gram negative cell wall by weight. There are no teichoic or lipoteichoic acids in the Gram negative cell wall. External to the peptidoglycan layer is the outer membrane, which is unique to Gram negative bacteria. The area between the external surface of the cytoplasmic membrane and the internal surface of the outer membrane is referred to as the periplasmic space. This space is actually a compartment containing a variety of hydrolytic enzymes, role in locations for protein function.
Why do we annotate proteins and not genes?
because not all gene encode for something? Would be futile to work n genes such as that, only annotate functional things, thus for proteins, functional, vs genes which can be pseudo?
DELTA-BLAST: -How is DELTA-BLAST different from PSI-BLAST?
both ntended for detection of distant protein homologs The difference with PSI-BLAST is that PSI-BLAST uses the results of a first blastp iteration to construct a PSSM and then uses it to search the sequence database. DELTA-BLAST uses PSSM derived from the CDD database, so the initial PSSM construction is much more quicker than PSI-BLAST.
Understand the basic model for chemotaxis in E. coli. CW vs. CCW flagellar rotation.
bundled flagella, CCW, CW flagella fly apart, Brownian motion switches direction, CCW motion again the bacteria continues moving
Why annotate?
compiling known information to predict the unction o an unknown gene, In the context of genomics, annotation is the process of marking the genes and other biological features in a DNA sequence. This process needs to be automated because most genomes are too large to annotate by hand, not to mention the desire to annotate as many genomes as possible, as the rate of sequencing has ceased to pose a bottleneck. Annotation is made possible by the fact that genes have recognisable start and stop regions, although the exact sequence found in these regions can vary between genes. DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it. • Once a genome is sequenced, all of the sequencings must be analyzed to understand what they mean. • Critical to annotation is the identification of the genes in a genome, the structure of the genes, and the proteins they encode. • Once a genome is annotated, further work is done to understand how all the annotated regions interact with each other.
: START Codon
extorquens AUG or GUG There can be overlap of the stop and start codon for different genes
Shine Dalgarno Sequence
extorquens GAGAGGAG
How is a good and informational result assessed in blast
low e value, Take another look at the subject (target) sequence(s) that have low E-values: CLOSE HOMOLOG: (yes or no answer) • must be of similar length • must have recurring motifs • must have similar functions Take another look at the subject (target) sequence(s) that have high E-values: DISTANT HOMOLOG: • are they of similar length? • do they have recurring motifs? • do they have similar functions? Verification: • Use target sequences as query sequences for another BLAST search • Does the original query sequence come up in report? • PSI-BLAST
blast -The E-value is defined as the odds that the sequence alignment (pairing) of your query sequence with the database "hit" could have happened by chance. Why does it make sense that E = (m)(n)(2^-S), if (m) and (n) are the length of the sequence and size of the database used, respectively?
smaller, more confident or match, gets larger with incrasing length f sequence which makes sense as there is a decreased chance of a full match more possible matches, with increased size of database increasing more options to compare to, must be taken into calculation, more comparison, but smaller E as there is an increased likelyhood of match with more options? Score of match increases, E value exponentially decreases These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance.Doubling the length of either sequence should double the number of HSPs attaining a given score. The size of the search space is proportional to the product of the query sequence length (n) * the sum of the lengths of the sequences in the database (m): N=n*m.
Predictability
some regions easier to predict than others
FASTA format
stands for FAST-All, reflecting the fact that it can be used for a fast protein comparison or a fast nucleotide comparison, text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. achieved by performing optimised searches for local alignments using a substitution matr
translational gee elements
start codon, stop codon, SD
PSORT -
system designed to make subcellular localization predictions of proteins
PI (isoelectric point):
the pH where the molecule carries no net electrical charge - shows a pattern in proteins that can tell us about the possible predicted location of a protein within a cell - 7.5-8.0 in the cytoplasm of E. coli