Genetics Ch. 24
What is a database? What types of information are stored within a database? Where does the information come from? Discuss the objectives of a genome database.
A database is a collection of many computer files in a single location. These data are usually DNA, RNA, or protein sequences. The data come from the contributions of many research labs. A major objective of a genome database is to organize the genetic information of a single species. A genome database will identify all of the known genes and indicate their map locations within the genome. In addition, a genome database may have other types of information, such as a concise description of a sequence, the name of the organism from which this sequence was obtained, and the function of the encoded protein, if it is known.
When comparing (i.e., aligning) two or more genetic sequences, it is sometimes necessary to put in gaps. Explain why. Discuss two changes (i.e., two types of mutations) that could happen during the evolution of homologous genes that would explain the occurrence of gaps in a multiple-sequence alignment.
A gap is necessary when two homologous sequences are not the same length. Because homologous sequences are derived from the same ancestral gene, two homologous sequences were originally the same length. However, during evolution, the sequences can incur deletions and/or additions that make the sequences shorter or longer than in the original ancestral gene. If one gene incurs a deletion, a gap will be necessary in this gene's sequence in order to align it with a homologous gene. If an addition occurs in a gene's sequence, a gap will be necessary in a homologous gene sequence in order to align the two sequences.
What is a motif? Why is it useful for computer programs to identify functional motifs within amino acid sequences?
A motif is a sequence that carries out a particular function. There are promoter motifs, enhancer motifs, and amino acid motifs that play functional roles in proteins. For a long genetic sequence, a computer can scan the sequence and identify motifs with great speed and accuracy. The identification of amino acid motifs helps a researcher to understand the function of a particular protein.
The goal of many computer programs is to identify sequence elements within a long segment of DNA. What is a sequence element? Give two examples. How is the specific sequence of a sequence element determined? In other words, is it determined by the computer program or by genetic studies? Explain.
A sequence element is a specialized sequence (i.e., a base sequence or an amino acid sequence) with a particular meaning or function. Two examples would be a stop codon (i.e., UAA), which is a base sequence element, and an amino acid sequence that is a site for protein glycosylation (i.e., asparagine-any amino acid-serine or threonine), which is an amino acid sequence element or motif. The computer program does not create these sequence elements. The program is given the information about sequence elements, which comes from genetic research. Scientists have conducted experiments to identify the sequence of bases that constitute a stop codon and the sequence of amino acids where proteins are glycosylated. Once this information is known from research, it can be incorporated into computer programs, and then the program can analyze new genetic sequences and identify the occurrence of stop codons and glycosylation sites.
With regard to DNA microarrays, answer the following questions: A. What is attached to the slide? Be specific about the number of spots, the lengths of DNA fragments, and the origin of the DNA fragments. B. What is hybridized to the microarray? C. How is hybridization detected?
A. A DNA microarray is a small slide that is dotted with many different fragments of DNA. In some microarrays, DNA fragments, which were made synthetically (e.g., by PCR), are individually spotted onto the slide. The DNA fragments are typically 500 to 5,000 bp in length, and a few thousand to tens of thousands are spotted to make a single array. Alternatively, short oligonucleotides can be directly synthesized on the surface of the slide. In this process, the DNA sequence at a given spot is produced by selectively controlling the growth of the oligonucleotide using narrow beams of light. In this case, there can be hundreds of thousands of different spots on a single array. B. In most cases, fluorescently labeled cDNA is hybridized to the microarray, though labeled genomic DNA or RNA could also be used. C. After hybridization, the array is washed and placed in a scanning confocal fluorescence microscope that scans each pixel (the smallest element in a visual image). After correction for local background, the final fluorescence intensity for each spot is obtained by averaging across the pixels in each spot. This results in a group of fluorescent spots at defined locations in the microarray.
Refer to question 3 in More Genetic TIPS before answering this question. Based on the multiple-sequence alignment in Figure 24.10, what is/are the most probable time(s) that mutations occurred in the human globin gene family to produce the following amino acid differences? A. His-119 and Arg-119 B. Gly-121 and Pro-121 C. Glu-103, Val-103, and Ala-103
A. Because most family members contain a histidine, this is likely to be the ancestral codon. The histidine codon mutated into an arginine codon after the gene duplication occurred that produced the squiggly-globin gene. This would be after the emergence of primates or within the last 10 or 20 million years. B. We do not know if the ancestral globin gene had a glycine or proline at codon-121. The mutation probably occurred after the duplication that produced the alpha-globin family and β-globin family, but before the gene duplications that gave rise to the multiple copies of alpha- and β-globins on chromosome 16 and chromosome 11, respectively. Therefore, it occurred between 300 million and 200 million years ago. C. All of the β-globins contain glutamic acid at position 103, and all of the alpha-globins contain valine, except for θ-globin. We do not know if the ancestral globin gene had a valine or glutamic acid at codon 121. Nevertheless, a mutation, converting one to the other, probably occurred after the duplication that produced the alpha-globin family and the β-globin family, but before the gene duplications that gave rise to the multiple copies of alpha- and β-globins on chromosome 16 and chromosome 11, respectively. Therefore, it occurred between 300 million and 200 million years ago. The mutation that produced the alanine codon in the θ-globin gene probably occurred after the gene duplication that produced this gene. This would be after the emergence of mammals (i.e., sometime within the last 200 millions years).
Take a look at the multiple-sequence alignment in Figure 24.10 of the globin polypeptides, focusing on amino acids 101 to 148. A. Which of these amino acids are likely to be most important for globin structure and function? Explain why. B. Which are likely to be least important?
A. The amino acids that are most conserved (i.e, the same in all of the family members) are most likely to be important for structure and/or function. This is because a mutation that changed the amino acid might disrupt structure and function, and these kinds of mutations would be selected against during evolution. Completely conserved amino acids are found at the following positions: 101, 102, 105, 107, 108, 116, 117, 124, 130, 134, 139, 143, and 147. B. The amino acids that are least conserved are probably not very important because changes in the amino acid does not seem to inhibit function. (If it did inhibit function, natural selection would eliminate such a mutation.) At one location, position 118, there are five different amino acids.
To identify the following types of genetic occurrences, would a computer program use sequence recognition, pattern recognition, or both? A. Whether a segment of Drosophila DNA contains a P element (which is a specific type of transposable element) B. Whether a segment of DNA contains a stop codon C. In a comparison of two DNA segments, whether there is an inversion in one segment compared with the other segment D. Whether a long segment of bacterial DNA contains one or more genes
A. To identify a specific transposable element, a program would use sequence recognition. The sequence of P elements is already known. The program would be supplied with this information and scan a sequence file looking for a match. B. To identify a stop codon, a program would use sequence recognition. There are three stop codons that are specific three-base sequences. The program would be supplied with these three sequences and scan a sequence file to identify a perfect match. C. To identify an inversion of any kind, a program would use pattern recognition. In this case, the program would be looking for a pattern in which the same sequence was running in opposite directions in a comparison of the two sequence files. D. A search by signal approach uses both sequence recognition and pattern recognition as a means to identify genes. It looks for an organization of sequence elements that would form a functional gene. A search by content approach identifies genes based on patterns, not on specific sequence elements. This approach looks for a pattern in which the nucleotide content is different from a random distribution. The third approach to identify a gene is to scan a genetic sequence for long open reading frames. This approach is a combination of sequence recognition and pattern recognition. The program is looking for specific sequence elements (i.e., stop codons) but it is also looking for a pattern in which the stop codons are far apart.
Discuss why it is useful to search a database to identify sequences that are homologous to a newly determined sequence.
By searching a database, you can identify genetic sequences that are homologous to a newly determined sequence. In most cases, homologous sequences carry out identical or very similar functions. Therefore, if you find a homologous sequence in a database whose function is already understood, this provides an important clue regarding the function of the newly determined sequence.
Give the meanings of the following terms: genomics, functional genomics, and proteomics.
Genomics is the study of genome composition. Researchers attempt to map all of the genes in the genome and ultimately to determine the sequence of all the chromosomes. Functional genomics attempts to understand how genetic sequences function to produce the characteristics of cells and the traits of organisms. Much of functional genomics is aimed at an understanding of gene function. However, it also tries to understand the roles of other genetic sequences such as centromeres and repetitive sequences. Proteomics focuses on the functions of proteins. The ultimate goal is to understand how groups of proteins function as integrated units.
What is the difference between similarity and homology?
In genetics, the term similarity means that two genetic sequences are similar to each other. Homology means that two genetic sequences have evolved from a common ancestral sequence. Homologous sequences are similar to each other, but not all short similar sequences are due to homology.
Explain how tandem mass spectroscopy is used to determine the sequence of a peptide. Once a peptide sequence is known, how is this information used to determine the sequence of the entire protein?
In tandem mass spectroscopy, the first spectrometer determines the mass of a peptide fragment from a protein of interest. The second spectrometer determines the masses of progressively smaller fragments that are derived from that peptide. Because the masses of each amino acid are known, the molecular masses of these smaller fragments reveal the amino acid sequence of the peptide. With peptide sequence information, it is possible to use the genetic code and produce DNA sequences that could encode such a peptide. More than one sequence is possible, due to the degeneracy of the genetic code. These sequences are used as query sequences to search a genomic database, which will (hopefully) locate a match. The genomic sequence can then be analyzed to determine the entire coding sequence for the protein of interest.
Discuss the bioinformatics approaches that can be used to identify a protein-encoding gene.
One strategy is search by signal, which relies on known sequences such as promoters, start and stop codons, and splice sites to help predict whether or not a DNA sequence contains a protein-encoding gene. This approach attempts to identify a region that contains a promoter sequence, then a start codon, a coding sequence, and a stop codon. A second strategy is search by content. This approach attempts to locate coding regions by identifying sequences where the nucleotide content displays a bias. A search by content approach attempts to locate coding regions by identifying regions where the nucleotide content displays a bias. A third approach for locating protein-encoding genes is to search for long open reading frames within a DNA sequence. An open reading frame is a sequence that does not contain any stop codons.
Besides the examples listed in Table 24.3, list five types of short sequences that a geneticist might want to locate within a DNA sequence.
Other types of short sequences are centromeric sequences, origins of replication, telomeric sequences, repetitive sequences, and enhancers. (Other examples are possible.)
Discuss the distinction between sequence recognition and pattern recognition.
Sequence recognition involves the recognition of a particular sequence of an already known function that has been supplied to the computer program. For example, a program could locate start codons (ATG) within a DNA sequence. By comparison, pattern recognition relies on a pattern of arrangement of symbols but is not restricted to particular sequences.
In this chapter, we considered a computer program that can translate a DNA sequence into a polypeptide sequence. A researcher has a sequence file that contains the amino acid sequence of a polypeptide and runs a program that is opposite to the program described in the chapter. This other program is called BACKTRANSLATE. It takes an amino acid sequence file and determines the sequence of DNA that would encode such a polypeptide. How does this program work? In other words, what genetic principles underlie this program? What type of sequence file would this program generate: a nucleotide sequence or an amino acid sequence? Would the BACKTRANSLATE program produce only a single sequence file? Explain why or why not.
The BACKTRANSLATE program works by using the genetic code. Each amino acid has one or more codons (i.e., three-base sequences) that are specified by the genetic code. This program would produce a sequence file that is a nucleotide base sequence. The BACKTRANSLATE program would produce a degenerate base sequence because the genetic code is degenerate. For example, lysine can be specified by AAA or AAG. The program would probably store a single file that had degeneracy at particular positions. For example, if the amino acid sequence was lysine-methionine-glycine-glutamine, the program would produce the following sequence: 5-AA(A/G)ATGGG(T/C/A/G)CA(A/G) The bases found in parentheses are the possible bases due to the degeneracy of the genetic code.
In this chapter, we considered a computer program that translates a DNA sequence into a polypeptide sequence. Instead of running this program, a researcher could simply look the codons up in a genetic code table and determine the sequence by hand. What are the advantages of running the program rather than doing the translation the old-fashioned way, by hand?
The advantages of running a computer program are speed and accuracy. Once the program has been made, and a sequence file has been entered into a computer, the program can analyze long genetic sequences quickly and accurately.
Describe the two general types of protein microarrays. What are their possible applications?
The two general types of protein microarrays are antibody microarrays and functional protein arrays. In an antibody microarray, many different antibody molecules, each one recognizing a different peptide sequence, are spotted onto the array. Cellular proteins are isolated, fluorescently labeled and exposed to the microarray. When a given protein is recognized by an antibody, it will be captured by the antibody and remain bound to the spot. Because each antibody recognizes a different peptide sequence, this microarray can be used to monitor protein expression levels. A functional protein microarray involves purifying cellular proteins and spotting them onto a slide. This type of microarray can be analyzed with regard to substrate specificity, drug binding, and/or protein-protein interactions.
In the procedure called RNA sequencing (RNA-Seq), what type of molecule is actually sequenced?
The type of molecule that is sequenced is cDNA.
A multiple-sequence alignment of five homologous proteins is shown here: see figure Discuss some of the interesting features that this alignment reveals.
There are a few interesting trends. Sequences 1 and 2 are similar to each other, as are sequences 3 and 4. There are a few places where amino acid residues are conserved among all five sequences. These amino acids may be particularly important with regard to function.
Discuss the reasons why the proteome is larger than the genome of a given species.
There are two main reasons why the proteome is larger than the genome. The first reason involves the processing of pre-mRNA, a phenomenon that occurs primarily in eukaryotic species. RNA splicing and editing can alter the codon sequence of mRNA and thereby produce alternative forms of proteins that have different amino acid sequences. The second reason for protein diversity is posttranslational modifications. There are many ways that a given protein's structure can be covalently modified by cellular enzymes. These include proteolytic processing, disulfide bond formation, glycosylation, attachment of lipids, phosphorylation, methylation, and acetylation, to name a few.
Can two-dimensional gel electrophoresis be used as a purification technique? Explain.
Yes, two-dimensional gel electrophoresis can be used as a purification technique. A spot on a two-dimensional gel can be cut out, and the protein can be eluted from the spot. This purified protein can be subjected to tandem mass spectroscopy to determine peptide sequences within the protein. It should be mentioned, however, that two-dimensional gel electrophoresis would not be used to purify proteins in a functional state. The exposure to SDS in the second dimension would denature proteins and probably inactivate their function.