BIO 331 Final
DNA replication
1 double stranded molecule is duplicated and occurs during mitosis and meiosis. DNA poly incorporates nucleotides that are complementary to those on the template strand. occurs in 5'-3' direction and nucleotides added to 3' OH. mistakes made sometimes.
How has the field of bioinformatics has changed over time?
1. first protein sequence in 1955, which laid the foundation for the creation of genomic data banks before the advent of the internet 2. The discovery of the first nucleotide sequence in 1968 was followed by the development of the first sequence alignment algorithm by Needleman & Wunsch, setting the stage for advancements. 3. 1971, the Protein Data Bank (PDB) was established, creating a repository for 3D protein structures. Shortly after, the first protein structure prediction algorithm was developed, paving the way for methods like DNA sequencing of the first whole genome 4. The Smith-Waterman algorithm for sequence alignment, the EMBL-ENA GenBank, and the discovery of Polymerase Chain Reaction (PCR) in 1986 5. advent of parallel & cloud computing has enabled the field to keep up with the demand for storing and analyzing BigData stemming from highthroughput Next Generation sequencing methodologies. 6. invention of "BLAST"
illumina sequencing
1. prep genomic DNA sample 2. attach DNA to surface 3. Bridge amplification 4. Fragments become double stranded 5. denature the double stranded molecules 6. complete amplification 7. determine first base 8. image first base 9. determine second base 10. image second chemistry cycle 11. sequence reads over multipel chemistry cycles 12. align data
Translate the below sequence in all 6 reading frames. If/when you get to a stop codon, simply list the term 'stop' and proceed no further. 5' ACTGACTACGAAAGC 3'
1: Thr Asp Tyr Glu Ser 2: Leu Thr Thr Lys 3: Stop 4: 5' GCT-TTC-GTA-GTC-AGT 3' -> 5' Ala-Phe-Val-Val-Ser 3' 5: 5' CTT-TCG-TAG-TCA-GT 3' -> 5' Leu-Ser-STOP 3' 6: 5' TTT-CGT-AGT-CAG-T 3' -> 5' Phe-Arg-Ser-Gln 3'
Mendel's laws
1st- law of segregation- during gamete formation, the paired hereditary determinants segrate such that each gamete is equally likely to contain either one 2- independent assortment- segrgation of members of a pair of alleles is inde. of the segregation of other pairs in the formation of reprod. cells
genome annotation workflow
2 main steps include structural genome annotation and functional genome annotation. Structural genome annotation is the process of identify coding genes (intron-exon structures) and non-coding genes (tRNAs). Functional genome annotation is the process of attaching meta-data structural annotations (which product is encoded in the gene).
Enhancers
A DNA sequence that recognizes certain transcription factors that can stimulate transcription of nearby genes.
What is a FastQ file? How can it be used to tell you information about the quality of your sequences?
A FastQ file is a text-based format for storing a biological sequence and the corresponding quality score for each base. It typically consists of 4 lines per sequence entry. Line 1: header line that begins with @ and a unique read identifier. ii. Line 2: sequence line that contains raw sequence letters. iii. Line 3: a separator line that typically contains a + symbol. iv. Line 4: the quality score line à encodes the quality values for the sequence in line 2 (must contain same number of symbols as letters) 1. ! = lowest quality and ~ = highest quality b. FastQ files can be used for quality control, preprocessing, and for reconstructing a genome.
Intron
A non-coding sequence/region that is situated within a gene that is not included in thefinal mature RNA molecule. After transcription, introns are 'spliced' out of the mRNA molecule and, hence, do not contribute to the final functional RNA product. Not all organisms have introns. Eukaryotic organisms (with nucleus) tend to have introns in their genes whereas prokaryotic organisms tend to not.
Exon
A region that remains in an RNA after splicing/maturation. While some portions of exons are regulatory in nature (i.e. 5' and 3' UTRs), and hence do not code for proteins (or even remain in the final structures of ncRNA molecules), the majority of nucleotides within exons of protein coding genes are 'coding', comprising the ORF/CDS. Eukaryotes contain exons and introns. Technically, most prokaryotic genes contain only exons.
promoter
A regulatory element/sequence of DNA that proteins bind to, when initiating transcription of a nearby gene. Promoters typically are at the 5' end of the transcription initiation site. Promoters can be found in every gene in eukaryotes; but prokaryotic genes found within a single operon may have just one promoter for the entire operon. Promoters are much more complex in eukaryotes.
What is sequence alignment?
A sequence alignment is a way of arranging proteins or DNA sequences to identify similarities that could explain evolutionary relationships, reveal homology, or common ancestry. Alignments arrange nt's or AA's within columns in a way that tends to minimize the #s of differences between the aligned sequences. Finding the optimal arrangement relies on the use of substitution matrices, generated through study or prior, manual alignments, to help understand the sets of rewards and penalties for matches and mismatches in an alignment.
Open reading frame (CDS or ORF)
A sequence of genetic material that can be read by cellular machinery to create protein through translation. Begins with a start codon (AUG in RNA) and end with a stop codon (UAA, UAG, or UGA). Consist of a multiple of 3 nucleotides as the codons determining the amino acid. Present in all organisms.
point mutation
A single nucleotide change. Within an open reading frame, point mutations can be missensse or non-synonymous --> change in DNA sequence causes codon to encode amino acid that is different from the original amino acid that was supposed to be encoded. Within ORFs/CDS sequences such changes are called non-sense mutations when the alteration causes a stop codon to be encoded. Silent/synonymous mutation is when the change in DNA doesn't change the amino acid that is encoded.
chromosome
A single thread like structure made of tightly packed DNA and proteins that store the organisms's genetic material. Linear chromosomes in eukaryotes have a central point called centromere with histones helping to pack the long DNA molecules. Humans have 23 pairs of chromosomes. Not all organisms have chromosomes.
gene
A specific sequence of nucleotides in DNA or RNA which is transcribed and coded for an RNA molecule. They can be protein coding or non-protein coding. Represents a full unit of information that controls a specific trait or characteristic. Humans have ~20,000 protein coding genes while some bacteria have only a few hundred. In all cells/genomes of all organisms. (hereditary determinants)
What is a substitution matrix and what is it used for?
A substitution matrix is comprised of rows and columns that show scores (+/- numerical values) that are applied for identities/similarities/differences between amino acids or identities/differences among nucleotides. This matrix quantifies the likelihood of 1 residue (amino acid or nucleotide) being replaced by another during evolution. It is used in sequence alignment algorithms commonly. By using this matrix, we can calculate alignment scores. In this method, all individual scores, plus gap and extension penalties are all added up to generate the total alignment score. This value is typically used to distinguish between optimal and suboptimal alignments.
codon
A trinucleotide DNA or RNA sequence that encodes a specific amino acid or acts as an signal that starts or stops protein synthesis. Each codon is representativeof a single amino acid of a protein eventually forming a unit of genomic information.Codons are present in the cells/genomes of all organisms as they play a crucial role in the process of protein synthesis. During the translation process, the codon sequence determines the amino acid order in theprotein.
describe the difference in structure between amino acids
All amino acids have a common structure that consists of an amino group, a carboxyl group, a hydrogen atom, and a side chain. The 20 types of amino acids differ from each other with respect to their side chains, also referred to as R groups. These R groups differ in polarity, electric change, and structure. Nine of these amino (glycine, alanine, valine, leucine, isoleucine, methionine, phenylalanine, tryptophan, and proline) have a non-polar side chain. Six of these amino acids (serine, threonine, cysteine, tyrosine, asparagine, and glutamine) have a polar side chain. Five of these amino acids (aspartate, glutamate, lysine, arginine, and histidine) have electrically charged side chains. Aspartate and glutamate are the acidic amino acids, with negatively charged side chains, while lysine, arginine, and histidine are the basic amino acids, with positively charged side chains.
BLAST searches that could use the BLOSUM62 matrix
BLASTp, BLASTx, and tBLASTn
What is DNA made out of ? How are two strands of DNA held together in a cell?
DNA is made up of nucleotides which contain a sugar, phosphate, and a base. It is organized into chromosomes and exists as double stranded alpha helix with 2 chains held together by hydrogen bonds between complimentary base pairs. Hydrogen bonds form between complimentary bases which form the Franklin-Crick base pairs which allows DNA to be double stranded. Also made up of the elements C, H, O, N, and P
what is DNA made up off and where is it found
DNA is made up of nucleotides, which are composed of a sugar, phosphate, and a nitrogenous base (purine or pyrimidine), in which the bases differ in structure. Complementary DNA bases are held together by H bonds and form a double stranded helix. Strands of DNA are then organized and packed into chromosomes. DNA is strictly found in the nucleus of the cell for eukaryotes and in the nucleoid of prokaryotes. DNA does not leave the nucleus at any time. It is also found in the mitochondria of eukaryotes (and in chloroplasts of organisms like plants). Note that prokaryotes do not have a nucleus.
What is dbSNP? Describe the type of data made available here.
Database for Single Nucleotide Polymorphisms and Other Classes of Minor Genetic Variation, this is a public domain archive for a broad range of simple genetic polymorphisms and contains a large collection of simple genetic polymorphisms.
What is the following equation used for E = Kmn e^(-λS) ? Define its variables.
Describe the E value. The E value is the number of alignments with scores greater than or equal to the score that are expected to occur by chance in a database search. It is derived from a description of the extreme value distribution. S=score, E=expected value=number of high scoring segment paris expected to occur with score of at least S, m = length of query sequence, n = size of database being searched, λ = Poisson distribution parameter (Karlin Altschul statistics), K=database size. Very high scores correspond to very low E values.
transitions and transversion
Descriptors for point mutations occurring anywhere within the genome. Transition: pyrimidine (Cytosine or thymine) replaced with the other pyrimidine or when a purine (adenine or guanine) is replaced with the other purine. Transversion: pyrimidine replaced with purine or vis versa. transition more common than transversion.
structure of the SARS Co-V2 genome
Has functional domains of spike proteins. The structural proteins included spike, envelope, membrane, and nucleocapsid. a. SARS Co-V2 was an RNA virus (has positive sense single stranded genomics RNA). The envelope is a lipid bilayer membrane with a matrix protein below that formed a shell (gives rigidity and strength to lipid membrane). RNA segments are located inside the virus which are the genetic material of the virus. Spike protein is the structural protein that gives the crown like shape of coronavirus particles.
Describe just why it is that each run of Illumina sequencing requires such a large amount of computer storage space.
Illumina sequencing is completed on a flow cell containing eight lanes. Each lane contains two columns and each column has at most 50 tiles. A single run outputs 2.9 terabytes worth of images. FastQ is a more useful way to store the data b/c it includes information on quality and the position of the flow cell lane where the sequence was generated, etc. FastA seqs can be generated from FastQ files after quality control steps have been taken. But in either case, both are text formatted files - so they take up way less space than image files.
What is Illumina sequencing?
Illumina sequencing is one of the two common NextGen technologies that use sequencing by synthesis. It involves sample prep (fragmenting the genome to pieces ~500 bp in length, selecting these fragments, and ligating adapters on the ends); loading into a flow cell; generating 'clusters' (clonally amplified pieces of the genome, each found in their own coordinate on the flow cell) through bridge PCR; and then sequencing through the addition of fluorescently labeled reversible terminators (i.e. a type of nucleotide) to infer the identities of nucleotides added to the strands synthesized in real time. At the end of an Illumin asequencing run, parts of the adapters are sequenced to aid in "de-multiplexing".
ab initio discovery of protein coding genes and what approaches such tools use to detect genes in eukaryotes
In eukaryotes: protein coding sequence divided into several parts (exon) separated by non-coding sequences (introns) and be on different frames. Gene prediction tools incorporate models of gene structure that take into account exon-intron boudries, splice site signals, and other feature specific to eukaryotic gene organization. Tools used: GeneMark (employs self training algorithm) and Augustus (uses combination of Markov modesl and statistical algorithms to predict genes based on features)
ab initio discovery of protein coding genes and what approaches such tools use to detect genes in prokaryotes
In prokaryotes: look for sequence of START and STOP codon and sequence coding for a protein occurs as one continuous open reading frame with a length of typically many hundreds or thousands of bp long. Tools use: GLIMMER (finding genes in microbial DNA), GeneMarkS (self-training method for prediction of genes), and Prodigal (microbial gene finding program)
which mutations are most likely to impact phenotype
Missensse and nonsense due to the changes in what is being encoded for by the codon. Also frameshift mutations because they are likely to change phenotype because the reading frame is being altered a way that will change the amino acids being encoded for. Mutations that occur within regulatory regions can also impact phenotype. This is because mutations may cause the genes to be expressed more or less than they were originally supposed to be expressed.
what are mutations
Mutations are an alteration in the genome resulting in altered nucleotide sequence that may cause a change in phenotype depending on the type of and location of mutation.
What is the difference between a mutation, a polymorphism, and a substitution?
Mutations are changes in DNA that occur either during replication or between cycles of replication (e.g. spontaneous deamination of C; formation of T-dimers). Alleles are different sequences at the same locus that exist due to mutation. Alleles are not exclusive to genes. Substitution = a mutation that has gone to 100% frequency/alleles can become fixed in a population and therefore present in all its members.
How and when do mutations arise?
Mutations occur during DNA replication if the DNA polymerase attaches a nucleotide that is not complementary to the nucleotide of the original strand. Can also occur due to spontaneous changes in the DNA which can happen during replication or when replication not occurring. Ectopic recombination events and transposition event can causes mutations. And an addition of DNA in a genome via lateral gene transfer are also type of mutations.
What is Next Generation sequencing and how has it changed the field of bioinformatics?
Next generation sequencing is a collection of methods to rapidly sequence DNA or RNA. Not only are they able to sequence a incredible amount of genetic information in a short period of time but they are also relatively cost efficient (i.e. compared to previously leaned-upon Sanger Sequencing). These technologies have changed the field of bioinformatics by allowing us to sequence human genomes in a matter of hours or days, and for under $1000. The influx of new genome data coming from this technology has improved the field of biology and medical science.
What features differentiate the different nucleotides? And how are the individual nucleotides linked together on a single strand?
Nucleotides can either by a purine or pyrimidine. Purines (A and G base) have a double ring structure (2 carbon-nitrogen ring bases) whereas pyrimidines have a single ring structure (1 carbon-nitrogen ring base). Individual nucleotides are linked together by a phosphodiester bond which is a covalent bond that is formed between 5' phosphate group of 1nucleotide and the 3'-OH group of another.
three general software/tools to support these pillars of reproducibility
Open Science Framework (OSF), Jupyter notebooks, and GitHub. OSF is a platform for managing and sharing research projects which allows researchers to collaborate, organize, and document their work in a transparent and accessible way
how to tell if 2 sequences are likely to be homologous?
Percentage sequence identity provides a basic way of quantifying an alignment. For a given alignment, such as the one above, the number of identical nucleotides can be determined. Then, the alignment can be scored based on the percentage identity across the entire sequence length. When a percentage sequence identity exceeds 25%, it can be expected that these sequences may be homologous.
Types of mutations
Point, insetion/deletion, duplication, transposition, inversion, translocation
what are proteins made off and where are they found
Proteins are made using mRNA, or messenger RNA, through a process called translation, from the start codon to the stop codon. This is done by successive addition of amino acids. Proteins can be found in most parts of the cell including the golgi apparatus (a eukaryotic organelle, which sends packaged proteins where needed), the endoplasmic reticulum, within the cytoplasm, and (for eukaryotes) the nucleus. For prokaryotes, proteins may cover the outer membrane and in the cytoplasm.
what is RNA made up of and where is it found
RNA is made up of transcribed genes, which are regions of DNA that encode something, such as RNA or a protein. DNA can encode for RNA or proteins with codons, which are triplet sets of DNA. RNA polymerase reads DNA and incorporates nucleotides which complement the template DNA to create an RNA strand. Codons then determine the information used to synthesize proteins. RNA can be found in the nucleus (for eukaryotes), ribosome, and cytoplasm of a cell. They are able to leave the nucleus during the translation portion of protein synthesis.
how are the molecules synthesizes shape the phenotype
Regulation of gene expression allows tissues with various functions to arise from the same DNA template. Differences between individuals arise from the slight variation in the genome that influences either the synthesized protein products directly or the regulators of those products.
What is scientific reproducibility? How is it different from replicability?
Scientific reproducibility is the ability of researchers to replicate study results using the same methods, data, and procedures as the original study. It is essential in the fact that it ensures that research finding can be verified independently to increase their reliability. Replicability is the ability to achieve similar results of a study but by using different methods, data, and/or experimental conditions. The goal is to see if the same results can be achieved under different circumstances.
Describe the field of structural biology. Give two examples of structural biology as discussed in class.
Structural biology studies molecular structures and the dynamics of biological macromolecules, such as proteins and nucleic acids. It also focuses on the changes in their structures and how said changes can affect function. This field mixes the fundamentals of biophysics, molecular biology and biochemistry. Ex: Sars Co-V2 and Evaluation of RNA secondary structure prediction
genome
Sum of genetic info an organism contains and is made of DNA or for some viruses, RNA. Human genome contains 3 billion nucleotide base pairs. All organisms have.
reproducibility crisis
The concern that many scientific studies lack reproducibility. Many workshops have tried to reproduce studies and have failed due to missing data, software, and documentation. The consequences of irreproducible and unreliable research include misleading the community, wasting research funds, slowing scientific progress, and tarnishing the reputation of associated institutions and colleagues. In a clinical research setting, irreproducibility has the potential to risk patient safety.
output in unix: cat kamala1.txt kamala2.txt | wc -l
The first process to the left of the pipe asks that two files, "kamala1.txt" and kamala2.txt", are opened. The pipe connects the first and second processes by using the output of the first process as the input of the second process. The second process to the right of the pipe counts the number of lines in each file and outputs those values.
the genetic code
The genetic code contains the full set of relationships between codons and amino acids. It is redundant and nearly universal. The genetic code is significant for all living organisms because it contains all the information needed to create protein. It provides a molecular explanation for the transmission of information from DNA to mRNA to protein.
how is information in the genome decoded to make functional molecules
The information in the genome is decoded to make functional molecules through a process called gene expression. This concept involves the transcription of DNA to RNA and the translation of RNA to proteins. These two processes are highly regulated and play a crucial role in determining the characteristics and functions of an organism. Transcription occurs in the cell nucleus, where the DNA sequence is transcribed into messenger RNA. This mRNA then travels to the cytoplasm, where translation takes place with ribosomes. During translation, the information in the mRNA is used to assemble a specific sequence of amino acids, building a functional protein. The regulation of gene expression ensures that the right genes are activated at the right time and in the right cells, allowing for the precise control of cellular activities.
amino acids (building blocks)
The order in which amino acids are strung together to form proteins is determined by the central dogma of biology; DNA is transcribed to mRNA which is translated to proteins. When looking at eukaryotic cells, DNA in the nucleus serves as a template for the synthesis of mRNA. RNA polymerase reads this DNA sequence and synthesizes a complementary mRNA strand. This mRNA strand then travels to the ribosomes, located in the cytoplasm, where it can undergo translation. During translation, the ribosome reads the mRNA codons, and the transfer RNA (tRNA) molecules that contain an anticodon complementary to the mRNA codon bind through base pairing. These tRNAs also carry a specific amino acid, such that when the tRNA anticodon binds to the mRNA codon the encoded amino acid will be placed at the carboxyl end of the protein, and a covalent bond will then link the amino acid to the growing protein chain.
Name tools (i.e. specific program names) that could be used for the ab initio discovery of protein coding genes
There are 2 main types of data used to find genes: ab initio prediction and sequence similarity. Sequency similarity looks at similarity between genome and known mRNA, protein products, and homologous sequences. Local programs such as BLAST, BLAT, FASTA, and Smith-Waterman look for regions of similarity between target sequence and possible candidate matches.
describe de-multiplexing steps/parts
These parts are called "barcode/index sequences" or "barcodes/indices", andt are unique for each sample containing genomic DNA loaded into the same flow cell lane. Such multiplexing saves $ and prevents generation of too much data for one sample. At the de-multiplexing step the barcodes associated with each sequence read (i.e. genomic fragment) will be matched against the list of barcodes assigned to each sample, so that software can then assign sequences to their respective "piles" so they can be analyzed and studied.
Name the codon position at which it is least likely to change the encoded protein (i.e. 1st, 2nd, or 3rd)?
Third position would be least likely to change the encoded protein because of its wobble position.
output in unix: grep "apple" fruitlist.txt > apples-to-apples.txt
This command searches for all lines containing the word "apple" in the file "fruitlist.txt" and then directs that those lines be exported to a new file called "apples-to-apples.txt".
non-coding DNA
This region of DNA does not code for proteins. With the exceptions of ncRNA genes, most non-coding DNA is not transcribed. It may contain regulatory elements. Non-coding DNA is found in all organisms, but can comprise over 95% of the genome cin eukaryotes with large genomes.In prokaryotes, non-coding DNA is much rarer and makes up a much smaller percentage of the prokaryotic genome.
how can you further verify that 2 sequences are aligned?
To further verify if these sequences are indeed homologous, more nucleotides upstream and downstream of this particular sequence should be examined to see if the percentage sequence identity is still greater than 25% over a larger range. Then, a substitution matrix should be used to generate a total alignment score, which will take into account individual scores, gap penalties and extension penalties.
Write out the Unix commands and arguments you would use to create a new directory in the directory you are working in. Call the new directory 'jane-seqs'. Now, what command would you use to navigate to this new directory?
To make a new directory: mkdir jane-seqs To navigate to this directory: cd jane-seqs
How do different types of alignment work and what are they trying to do?
We can quantify/score an alignment by using calculating the percentage sequence identity. The gaps at the ends of the sequence are not counted when summing up the percent identity. They may represent a lack of collected sequence data. There are a few possible penalties that could occur when calculating alignment scores. These include gap penalties (pre-specified reduction in the alignment score added upon introduction of gaps into an alignment) and gap extension penalties (extending a gap once it's been opened). Global alignment tries to align the entire length of the sequence as an attempt to maximize the overall similarity between the sequences. Local alignment identifies regions of similarity between sequences and focus mainly on finding the most significant regions of similarity.
Why do we often not penalize end gaps in alignments? And why might a gap extension penalty be different from a gap opening penalty?
We often do not penalize end gaps in alignments because they might be there due to a lack of data. Gap extension penalties are when a gap is already open but made longer. These penalties are usually lower than gap penalties because indels often will be longer than 1 nucleotide. Gap penalties will be higher than point mutation penalties because indels are less likely to occur in regions of importance than point mutations, which is a reason why the initial gap penalty should be higher than a gap extension penalty.
translocation
a large piece of DNA breaks off and moves to a new place in the genome
locus
a specific physical site of a particular gene or non-coding DNA segment located on a region of chromosome. Loci are present in the cells/genomes of all organisms.
What is the subdiscipline of bioimage analysis?
a. Bioimage analysis helps analyze large amounts of image data, reproducibly extract quantitative information from images, and quantify the form and structure of cells and organisms. Bio-image analysis is the sub-discipline of computational biology.
Define bioinformatics.
a. Bioinformatics studies biological questions by analyzing molecular data. It is an interdisciplinary field that combines informatics and computer science, statistics and mathematics, chemistry and biochemistry, pharmacy and medicine, and biology molecular biology and genetics. Bioinformaticians develop and implement new algorithms and software, develop curate, and maintain databases, work with application of tools and databases, generate biological knowledge, and integrate existing tools into pipelines of workflow for high automation. a. These tools can be applied in structural, sequence, and function analysis as well as software development and database construction and curation. i. Structure analysis: nucleic acid and protein structure prediction, classification, and comparison. ii. Sequence analysis: genome comparison, phylogeny, gene and promoter prediction, motif discovery, sequence database searching, and sequence alignment. iii. Function analysis: metabolic pathway modeling, gene expression profiling, protein interaction prediction, protein subcellular localization prediction.
In NCBI's nucleotide database, how many results do you get when you search for all 28S rRNA genes between 700- 1000 bp in length from: Nematodes? Green algae? (note, you will need to save these results) Ascomycete fungi?
a. Nematodes? Using the search query "28S rRNA AND 700:1000[SLEN] AND Nematoda[organism]" produced 8,498 results. b. Green algae? (note, you will need to save these results) Using the search query "28S rRNA AND 700:1000[SLEN] AND Chlorophyta[organism]" produced 706 results. c. Ascomycete fungi? Using the search query "28S rRNA AND 700:1000[SLEN] AND Ascomycota[organism]" produced 38,334 results.
Insertion or deletion
aka indel mutation: when nucleotides are removed or added. If this mutation occurs within OFR, it will cause a frameshift if indel is not a multiple of 3bp and if the indel doesn't fall between existing codons. Frameshifts drastically alter reading frame and hence the encoded proteins.
%GC content
average GC-content in human genomes ranges from 35% to 60% across 100-Kb fragments, with a mean of 41%
NCBI search query terms
boolean operators: AND, OR, NOT [ORGN] or [organism], sequence length [SLEN], [GENE], range of accession numbers
2 of the nodes on drexel's picotte HPC cluster are _____ nodes. most of its reminaing nodes are used as ____ nodes.
cluster, compute
genotype
combination of alleles within an individual
BigData
datasets that are too large or complex for standard means of processing to handle. analyzed by bioinformaticians.
what do bioinformaticians do?
develop and implement new algorithms and software, database devel. and curation and maintence, application of tools and databases, integration of existing tools into pipelines for high throughput automation, generate biological knowledge.
allele
different types of single gene
output in unix: less kamala2.txt
display the contents of a file called "kamala2.txt" on the screen
Why do some changes between amino acids get rewarded (i.e. not penalized) in various protein substitution matrices?
give penalties/rewards based on these frequencies- they get rewarded because they occur often (and are usually changes among chemically similar AAs)
HPC
high performing computer cluster
low complexity filtering
identify and remove regions of low complexity from nucleotide or protein sequences. Low complexity regions are sequences that contain repetitive or simple patterns, such as long stretches of a single nucleotide or amino acid, tandem repeats, or sequences with biased compositions.
DNA damage
is defined as any modification of DNA that changes its coding properties or normal function in transcription or replication
an additional node on picotte is used as the ____ scheduler
job
illumina sequencing adaptors
lab synthesized DNA; can ligate to genomic fragments
like most HPC, picotte uses ____ as its operation system
linux
pillars of reproducibility
literate programming, code version control and sharing, compute environment control, persistent data sharing, and documentation
output in unix: rm myfile
myfile gets removed from the system
each node consists of multiple cores (parts of CPU that recieve instructions and perform computations based on instructions). the nodes are connected with a fast ____ _____ to allow communication between them.
network cable
OMIM
online medelian inheritance in man- comprehensive database of human genes and genetic phenotypes
phenotype
physical characteristics of individuals
transposition
piece of DNA is moved to another part of the genome by a 'selfish' piece of DNA that has evolved to make copies of itself
inversion
piece of DNA is placed into chromosome in the opposite orientation than it is supposed to be in
what types of molecules are synthesized at the direction of a genome and what do they do?
proteins and RNAs- proteins: structure and function of cells, generate movement, provide support and shape, control gene expression, signaling molecules to regulate physiological processes, catalyze chemical reactions within cells. Other than mRNAs that are the template for these proteins, there are ribosomal RNA and transfer RNAs that work during the process of translation and other cellular processes. Lastly, micro RNAs regulate gene expression by binding to mRNA and inhibiting translation, and long non-coding RNAs that also regulate gene expression and work on chromatin modification as examples of non-coding RNAs that are products of the same genome synthesis.
index/barcode
sequences forming part of the ligated adapters; allow you to multiplex samples
another name for a node
server
recombination
shuffling of genetic material mediated by crossing over between chromosomes
duplication
some sort of DNA is copied more than it is supposed to be
split reads" in an RNAseq dataset, what causes it, and what organism would you see it in?
split reads in the context of RNAseq = when RNAseq reads map back to more than one region of the genome. arises due to the presence of introns in the genome, but not the mRNA transcriptome (they are spliced out). so adjacent exons within the mRNA will map back to 2 non-contiguous sections on the chromosome, and are interrupted by an intron. this happens in many eukaryotes genomes, which often have introns in their genes.
output in unix: cat kamala1.txt kamala2.txt > kamalacombo.txt
the output of kamala1.text file and kamala2.text are concatenated into file and then that output is directed into a new text file called kamalacombo.txt
mendel's law apply:
to diploid organism (loci), dominance/recessive relationships and a simple gene (phenotype correspondence), 2 genes are not close together
parallel computing
used as a solution to BigData HPC vs cloud drexel uses picotte
cloud computing
way of organizing computing resources so that users can conveniently rent them instead of buying them
degeneracy
when a codon position could be changed in one or more ways without changing the encoded amino acids
Describe the following terms and how they are relevant when performing BLAST searches: words, word size, thresholds, and extension in 2 directions.
words: short sequences from the query sequence that are used as seeds to initiate the search for similar regions in the database sequences word size: parameter that specifies the length of the words that are used as a seed for initiating the search. Adjusting the word size affects the reactivity and the speed of the search, a smaller word size will increase the reactivity but decrease the speed of the search. Also, blastn and blastp have set the default word size as 11 and3 respectively. thresholds: scores that a word must reach or exceed to be considered a match and used as a seed for further research. The threshold score is calculated based on a substitution matrix Extension in 2 directions: After a word meets the threshold score that was mentioned above it is considered a match. Blast then extends this match in both directions (hence why it is called extension in 2 directions) to find a longer alignment. The extension will continue as far as the alignment score increases unless the threshold drops to an unacceptable score. This process is critical to identify longer regions of similarity that are not apparent from the initial word match.