MICB 405
What is the command to view the access rights of a file and directory?
$ls -l
What is an assembly?
1) An assembly is a data structure that maps sequence reads to a putative reconstruction of the target sequence. Reads are grouped into contigs and contigs are grouped into scaffolds. 2)Contigs provide multiple sequence alignments and the consensus sequence. Scaffolds define the orientation and order of contigs as well as the number of gaps between them. 3)Assemblies are measured by the size and accuracy of their contigs and scaffolds.
What are the 2 modules of BWA ?
1) BWA aln <=75 nt reads 2) BWA mem >=75 nt reads Difference in seed length and how gaps are handled also BWA-SW - >>100 nt
What are the 3 steps to analysis of sequence reads?
1) Base calling 2) Reference alignment/assembly 3) application specific analysis
What are the main two steps involved in building a phylogeny?
1) Build a multiple sequence alignment 2) Use the multiple sequence alignment to build a phylogeny
What are the caveats of progressive alignment methods?
1) CLUSTAL requires 20x the CPU time of other heuristic methods (MAFFT, MUSCLE) 2) The overall quality of an alignment depends on the quality of the initial alignment. A bad initial alignment propagates errors through the entire alignment. 3) It is difficult to manage insertions and deletions without extensive manual editing.
What does know your reagents mean in the context of similarity searching?
1) Changin your Database choice is changing your search space. 2) Database size affects BLAST statistics record a) Database choice b) Database Size c) BLAST parameters with every search 3) Databases change rapidly and are frequently updated - you may need to repeat the analyses
How do we find out who infected whom?
1) Construct a phylogenetic tree to quickly identify clusters of related isolates 2) SNP-by-SNP examination or transmission inference methods - to track person-to-person or exposure-to-person events
What are the 3 steps involved in using samtools?
1) Convert a SAM file to a BAM file $ samtools view -b file1.sam > file1.bam 2) Sort the BAM file $ samtools sort file1.sam file1.sorted.bam 3) Index the sorted BAM file $ samtools index file1.sorted.bam and a .bai file will be
What are the limitations of whole genome sequencing?
1) Cost 75$ to 250$ per test 2) Difficult to validate genomic results 3) There is usually no consistent performance across samples since the sequencing platforms, program tools and databases are constantly changing. 4)Medical tests must be accredited 5) It is difficult to communicate genomic information to a busy medical doctor.
How is random sampling done in bootstrapping?
1) Create multiple pseudo alignments by grabbing and reshuffling columns from the original alignments 2) Create a pseudo tree for each pseudo alignment. 3) Count the number of times a branch event in your tree occurs in n pseudo trees. 4) If the branch occurs in >80% of the pseudo trees it is well supported by data. If <50% it is not supported by data, short distances can be the cause of this low support.
3 programming steps to perform the BWA alignment
1) Generate an index from your reference fasta file bwa index reference.fa 2) Align your reads to the generated index bwa aln reference.fa myread1.fq > myread1aligned.sai bwa aln reference.fa myread2.fq>myreadaligned2.sai 3) Generate alignments from paired-end reads in SAm format bwa sampe reference.fa myread1aligned.sai myreadaligned2.sai myread1.fq myread2.fq > aln-pe.sam
Name the 3rd generation sequencing platforms
1) Helicos Heliscope 2) Pacific Biosciences 3) Oxford Nanopore technologies
What are the various parts of a computer?
1) Input: Accepts information/codes from humans or other computers 2)Memory: a) Primary memory storage : i) Stores programs that are executed ii) aka RAM b) Secondary Memory storage: i) CDs, USBs, Hard disk 3) Processing Unit 1) Arithmetic Logic unit: Executes arithmetic and logical operations. 2) Control unit: a) controls the order of operations b) "Nerve Center" of the computer c) Accesses, interprets and directs instructions 4) Output: Sends processed results to the outside world.
What are the limitations of clinical metagenomics?
1) It can be quite challenging since sample preparation and library construction requires time and skill. 2)Everything in the lab must be traceable sample, reagents and lots. 3)There are several factors that can still prevent detection of the pathogen- Phase of infection, Pathogen titres and sample qualities 4)If the organism is not in the reference database you won't get results. 5)It is difficult to distinguish between pathogen, commensal and contaminant since there are no well established threshold levels to follow. 6)Currently it is still a test of last resort
What are the steps of Clustal?
1) Matrix (pairwise distances) 2) Guide Tree (roadmap) 3) Align (alignment begins) 4) Iterate (alignment continues) 5) End (alignment ends)
OLC challenges
1) Repeats, errors, polymorphisms and other ambiguities result in forking paths that decrease contig length and increase the graph complexity. 2)The OLC is suited to fewer than billions of reads >300 bp. Each node is a read so reads <300 bp result in complicated graphs that take a lot of time and analysis. 3) Computational complexity relates to both the number of pairwise comparisons as well as the number of edges in a graph. This information is difficult to parallelize as the graph is held in RAM.
What are the goals of similarity searching?
1) Retrieve sequences from a database - search a big database efficiently 2) Identify true positives (i.e. orthologs) - infer function, transfer annotations, structure/domain information 3) Limit false positives (i.e. misidentification of non-orthologs) 4) Find the optimal alignment.
What are the 3 main stages in read alignments?
1) Sequence and Quality String (Fastqc file) 2) BWA aligner and reference genome 3) Read alignment and Mapping Qualities
What can MSAs be used for?
1) To build a phylogenetic tree. Identifying the conserved regions, mutations and indels can enable a scientist to reconstruct the evolutionary history of a sequence. 2) To Identify conserved regions in a sequence family. - Can indicate that these regions are important to the structure/function of the protein. -Can be used to design "universal" primers, identify potential vaccine antigens, enzyme active sites etc. 3) To Identify variable regions in a protein family -In bacteria and viruses, often may indicate some sort of immunogenic importance - can be used to strain type a bacteria.
Why would one construct a phylogenetic tree?
1) To identify two sequences that are the most closely related to each other. 2) To identify two sequences that arose from a common ancestor 3) To find out when two sequences diverged 4) To study the population structure of an organism of interest
What are the redirection commands and how are they used?
> can be used to redirect standard output [mhirst@micb405 /]$ cat > list1 [enter] DNA Microbes Party [Ctrl + D] to stop [mhirst@micb405 /]$ cat >> list1[enter] you can add stuff to the existing list [mhirst@micb405 /]$ cat list 1 list 2 > biglist1 [enter] you can add two lists together < can be used for standard input [mhirst@micb405 /]$ sort< list1 [enter] The output of one command can be pipelined to another [mhirst@micb405 /]$ cat list 1 | sort [enter]
What are hard links?
A direct connection between two entries in two different databases. Not all possible links are present they depend on the source information: 1) A link from a protein sequence to a 3D crystal structure 2) A link from a nucleotide sequence to a publication describing that nucleotide 3) A link from a nucleotide sequence to the protein CDS 4) A link from a protein sequence to a taxonomy database.
What is a Euler path?
A eulerian path is one that crosses ever edge exactly once without repeating, if it ends at the start it is a euler cycle.
What is the format of a FASTQ file?
A fastq file is composed of the sequence and the corresponding base qualities: Each sequence has four lines: 1) The first line starts with '@' and is followed by the unique identifier and an optional description 2) The second line contains the sequence in A/G/T/C 3) The third line begins with '+' followed by the optional unique identifier (from line one) 4) The final line contains the base qualities in ASCII base 33 code for each of the bases in the corresponding line 2 sequence.
File
A file is a collection of data. Files are created by users using a text editor, compiler etc.
Hamiltonian circuit
A hamiltonian circuit visits each node exactly once before returning to the start.
Compute time and resources
A heuristic may only marginally improve compute time with higher costs
What is a phylogenetic tree?
A phylogenetic tree is a 'family' tree that shows the relationship between sequences
What is a phylogenetic tree and what are its components?
A phylogenetic tree is a diagram of the inferred evolutionary relationship between a set of sequences that arose from a common ancestor. It is composed of nodes (branching points) and tips/leaves (the sequences being examined). It has branches that may but not always represent the evolutionary time/ genetic distance.
Define a process
A process is an executing program identified by a unique process identifier (PID).
What is the basic workflow for illumina sequencing?
After the single stranded DNA library is located and the A and B adaptors are attached to the ends of each strand the DNA are applied to a large cell containing the two fragments complementary to the A and B adaptors. The Fragments bind to these adaptors and the fragments are clonally amplified using bridge amplification. After bridge amplification is complete the double stranded molecules are denatured to form single stranded molecules and sequencing by synthesis occurs where the incorporation of each fluorescently incorporated nucleotide is followed by an imaging step to determine the base identity. the 3'OH of base is inactivated initially to ensure the incorporation of only one base after imaging there is a chemical step to remove the fluorescent group and unblock the 3'OH of the incorporated base to allow incorporation of another base.
Whats the difference between global and local alignments?
Alignments can be either global or local. A Global alignment is an optimal alignment that includes all characters from each sequence (generated by Clustal). A local alignment is an optimal alignment that includes only the most similar local region/regions (blast generates local alignments )
Define a directory
All files are grouped together in a directory structure. The file system is arranged in a hierarchical manner like an inverted tree. The top of the tree is traditionally called the root.
What are homologs?
All sequences in a tree should be homologs, they share a common ancestor
What is BLAST and how does it work?
BLAST is a program used to identify similar sequences: 1) There are precomputed BLAST results for all sequences in GenBank 2) Sequence similarity meets a statistical criteria(cutoff) 3) There is a different list of neighbors for nucleotide vs protein sequences. 4) Two sequences with a high degree of sequence similarity often have related biological functions
What is different about character based methods?
Character based methods of phylogeny perform better: 1) They take into account the characters and mutations. 2) They require an evolutionary model 3) They are slower 4) They result in more than one 'best' tree E.g. Maximum Parsimony and Maximum Likelihood
What are the limitations of sequencing and assembly software?
DNA sequencing technologies share the fundamental problem that read lengths are shorter than even the smallest genomes. Whole genome shotgun overcomes this problem by oversampling the target sequence with reads from random positions. Assembly software then reconstructs the target sequence. Assembly software is challenged by repeat sequences, non-uniform coverage, non-computational complexity and sequencing error.
What is phasing and pre-phasing?
Depending on the efficiency of the fluidics and sequencing reactions, a small number of molecules in each cluster may fall behind (phasing) or run ahead (prephasing) of the current incorporation cycle. This effect can be mitigated by applying corrections during base calling by using the statistical averages over many clusters and sequences to estimate the correlation of signals between different cycles.
How do you identify real matches?
Discriminating between real and artifactual matches is done by estimating the probability that the match occurred by chance (E, Event) . When E value decreases (S) or Score of a match increases.
What is the distance based method and what are the pros/cons?
Distance based methods include Neighbour Joining. Neighbour Joining is a method of constructing a phylogenetic tree: 1) Calculate the distance matrix based on the pairwise genetic distances. 2) Cluster the two most closely related sequences 3) Re-calculate the distance matrix with the fusion of the two sequences and re-cluster Advantages - Quick, suitable for bootstrapping - Results in one tree ("deterministic") Cons -Does not take into account actual characters so A to T mutation is considered the same as an A to G to C to T mutation -Does not take into account an evolutionary model
Whats the best way to check the quality of an MSA?
Ensure the alignment of known sequences: 1) Secondary structure elements 2) Antigenic regions 3) Functional Motifs
What is Entrez?
Entrez is an integrated database retrieval system that provides access to a diverse set of 39 databases that together contain 1.7 billion records
How is file security denoted?
File security for each file/directory is denoted as a 10 character string in the left most column. the first character is d if the security permissions are for a directory and - if not. The next 3 characters refer to the permission for the user to which the file or directory belongs, the middle group of 3 denotes the permissions for the group to which the files/directory belongs and the last 3 are the permissions for everyone else. r- read file/ or list files in directory w- write file/ or delete files in a directory x- execute file/ or access files in a directory
What is GS FLX 454 sequencing?
GS FLX 454 sequencing is a technique that was developed by Roche diagnostics in 2004. The technique is based on the principle of 'pyrosequencing' in which the incorporation of a nucleotide produces a pyrophosphate molecule that, through a series of downstream reactions, produces light by the cleavage of oxyluciferin by luciferase.
Why is Genomic Epidemiology better than molecular epidemiology?
Genomic Epidemiology is a higher resolution method. Molecular epidemiology also lacks consistency of results.
How do you ensure good computational science?
Good computational science requires the same critical analysis as at the bench. One needs to ask the following questions: 1)What assumptions am I making? 2) Is the experiment working correctly? 2a) Have I used the appropriate controls 2b) What are the statistical and other indicators given to show the significance of the program output. 3) Does the output match the expected results? 4) What conclusions can be drawn? 5) Do the conclusions from this experiment complement those from other experiments at the bench or using other programs?
Define Heuristics
Heuristics is finding a 'good enough' solution that works in a reasonable time frame (i.e. the clock time that an algorithm takes to complete a task). It involves trading optimality, accuracy, completeness or precision for compute time and resources.
How does assembly depend on coverage?
High depth coverage is necessary for high quality of assembly. Sequencing depth requirements depend on size of the genome, G+C content, sequencing platform and number of repeat sequences. Completion of the genome can be estimated by counting the number of single copy marker genes.
Completeness
If a read maps to 100 regions in a genome do we need to know all of them?
Optimality
If a read maps to multiple regions in a genome with 1 base mismatched do we need to which is correct?
What is base calling chastity?
It is a formula used to flag polyclonal clusters. = Brightest Intensity / (Brightest intensity+second brightest intensity) >/= 0.6 with 1 allowed failure for the first 25 bases.
MSAs
MSAs try to find the fast, best global alignment since optimizing the pairwise global alignment for all possible pairs in a large sequence pool is slow. The first implication of this is lower accuracy. While dynamic programs can ensure an optimal alignment output MSAs must take heuristic shortcuts. The second implication of this is manual review. Since MSAs generate only approximate alignments, these alignments must be reviewed by eye.
Mapping Qualities
Mapping qualities show how confident the aligner is with the read mapping. Mapping qualities are not the same as BLAST expect values. Mapping qualities measure the probability that a read is misplaced. Mapping qualities are derived from base qualities, and the frequency and number of mismatches from the best alignment vs all other possible alignments. Mapping Qualities are reported on a Phred Scale
What is MP and why or why not would you use it?
Maximum Parsimony is a method of generating a phylogenetic tree following the principle that the best tree can me made by the fewest evolutionary changes. The approach is to generate all the possible trees for a set of sequences and then score each tree according to the number of evolutionary changes required to make the three. The advantages: It is better than the distance method and the fastest among the character based methods. Disadvantages: Doesn't take into account evolution, so the deeper the divergence the more inaccurate it is.
What is ML and why or why not would you use it?
Maximum likelihood method states that the best tree is the one that is most likely to give rise to the sequence data. Whereas MP counts the number of mutations and determines the least number of mutations required to explain a tree, ML focuses on understanding the various possible methods by which those mutations can occur and determination of which methods are most biologically feasible. Trees are generated for x sequences in a database. Heuristics are used to determine which tree shapes are more realistic. The sequences are then arranged over the tree shapes in various orders. The likelihood that each tree shape and sequence order combination gives rise to the multiple alignment sequence is calculated and the tree with the maximum likelihood is selected. This method is the most accurate phylogenetic method however it is very slow.
What is the point of MSA?
Multiple sequence alignment enables one to 1) Align 3 or more homologous sequences - sequences must be from genes/proteins that have some sort of evolutionary relationship. (i.e. its alright to align Influenza HA proteins but its not ok to align Influenza HA and NA proteins) 2) To align all homologous regions of all sequences analysed in the same column.
What are the 3 nucleotide sequence databases and what is the name of their collaboration?
NCBI- National Center for Biotechnological Information DDBJ- DNA Data Bank of Japan ENA- European Nucleotide Archive Collaboration- INSDC - International Nucleotide Sequence Database Collaboration
What are neighbors and what are some examples?
Neighboring is a different method of making connections between entries in different databases. Neighboring is subjective since the definition of similarity is different for each database. E.g. Similar sequences, 3D structures, Publications
When do you use nucleic acids and when do you use protein sequences to construct a phylogeny?
Nucleic acids are best when: a) The sequences are closely related - same AA but different codons b) You are trying to make a species phylogeny - use rRNA Proteins are best when: a) The sequences are more divergent b) You are analyzing sequences from a range of bacteria - different bacterial genomes have different G+c content and these G+C preferences may look like silent mutations.
How do you assess the reliability of a tree?
One method is to use bootstrapping. Bootstrapping is a statistical test that shows that resampling the data (shuffling the alignment up) doesn't alter the tree topology and that a few positions in alignment are not controlling the overall layout of the tree.
What are orthologs?
Orthologs are homologs that are produced by speciation and share a similar function
What is an OLC and how is it constructed?
Overlap Consensus Layout 1) Overlap between reads is used to create links between them resulting in a directed graph based on all versus all alignment that completes a hamiltonian circuit. This identifies reads that can be merged to create a contig consensus sequence. 2) The genome is assembled by aligning the sequences of adjacent clones and creating a path through these alignments that results in a non-redundant sequence. 3) Short read assemblers based on this method include Celera, Canu, VCAKE, SSAKE, SHARCGS and Newbler
Pairwise alignments vs MSA part 1
Pairwise alignments (alignment of two sequences against each other): To try and find the mathematically optimal alignment or to try to find the best fast alignment (e.g. with the use of a BLAST word based heuristic) Can be local or global Pairwise alignments are hence: 1) Computationally expensive especially if optimized 2) mathematically accurate
What are paralogs?
Paralogs are homologs within a species produced by gene duplication that have different functions.
What are progressive alignment strategies?
Progressive alignment strategies start by aligning two sequences and then add on other sequences iteratively, one by one.
Accuracy and precision
Provide a way to rank quality of alignment against other possible solutions
How many reads do we need to sequence a target genome?
Rn = CT/(PfLr) Rn = number of reads needed to sequence a target gene/genome C = sequence fold coverage T = length of target sequence in base pairs Pf = pass-rate which is the fraction of reads above a certain quality threshold Lr = Read length
Samtools
Samtools is a set of utilities that manipulates files in the BAM format. It imports from and exports to the SAM format, does sorting, indexing and merging and allows the swift retrieval of reads from any region. Samtools is designed to work on a stream. It regards input files as standard in (stdin) and output files as standard out (stdout).Several commands can thus be combined using UNIX pipes. Samtools always outputs error messages and warnings to standard error (stderr). Samtools searches the current working directory for the index file and downloads the index file upon its absence. Samtools does not retrieve the entire alignment file unless asked to do so.
What is the difference between soft-clipped and hard clipped bases?
Soft-clipped bases are those segments of the sequence where the 5' to 3' sequence has not been included in the alignment but it is still part of the read sequence in the bam file. Hard clipped bases are bases that have not been included in the alignment and have also been removed from the bam file. So the real sequence length in the latter case would be = SEQ + count of hard-clipped bases
What are the main steps in Illumina Hiseq base calling?
Terabytes of images --> intensity files --> 3.6 Terabytes of base calls and qualities (qseq files)
What is the basic workflow for GS FLX 454 sequencing?
The GS GLX 454 involves emulsion PCR. The library fragments produced by shearing, ligation and adaptor addition are amplified using beads with the sequences complimentary to one of the adaptor sequences on each fragment to be amplified. The emulsion mixture of oil and water creates microreactors where each bead is isolated in its own micelle and the fragment with the complementary adaptor sequence attached to the bead is amplified. Each bead at the end of PCR contain multiple copies of one of the fragments and is applied to a picotiterplate with enzyme beads. The solution applied to the plate cycles between each of the 4 bases and at the with the addition of each base the machine measures incorporation of the base via the subsequent release of the ppi molecule and the cleavage of oxiluciferin leading to the release of light.
What is a kernel?
The kernel is the hub of the UNIX OS. It allocates time and memory to the programs, handles the filestore (files and directories), and handles communications in response to system calls. For e.g. when you type in rm myfile, the shell searches for myfile in the directories, when it finds the file it directs the kernel , through sytem calls, to run the program rm on the file. Once the process is complete the shell returns the UNIX prompt $ to the user to inform the user that it awaits further commands.
What is the ls command and what does it do?
The ls command lists the contents of your current working directory: [mhirst@micb405 /]$ ls [enter]
What is the shell?
The shell is an interface between the user and the kernel. When you first log in , a log-in program checks your username and password and starts another program called the shell. The shell is a command line interpreter. It interprets the commands you type and arrange for them to be carried out. The commands themselves are programs and when they finish, the shell gives you another prompt.
How does one increase the speed of an alignment?
To increase the speed of an alignment one needs to convert the genome and/or reads into an indexed table of short words.
Why do traditional diagnostic tests fail?
Traditional diagnostic tests cannot test for everything all of the time. For certain conditions there is a high amount of negative test results. Traditional diagnostic tests cannot identify new strains of bacteria i.e. there are some cases when the pcr primers no longer work. Traditional methods like culturing and serology can only reveal the presence or absence of a pathogen. We need additional tests to determine drug resistance, virulence factor, epidemiology etc. New genomic methods can provide all this information.
How do we identify a mystery pathogen?
Use clinical metagenomics. Obtain a sample from the patient for sequencing (some samples contain more human DNA than others). Once the sample is sequenced used the two-step binning approach. In the first step the reads are sorted into higher level bins: human, virus, bacteria etc using the k-mer method. In the second step the reads are sorted into species-level bins by using pairwise alignment between reads in the same bin. Once the species level reads are obtained map these reads to a reference database to identify leads for the pathogen. Tools that can be used includ Codex, Kraken, SURPI and Taxonomer.
What is the probability that a base is not sequenced?
Use the Lander-waterman model which is based on a poisson distribution and assumes that all sequences are randomly arranged in a genome. Po = e^-c Po = probability that a base is not sequenced e = constant C = coverage = LN/G L= Read length N= number of reads G= length of target sequence in base pairs
How do you find out which directory you're in?
Use the print working directory command [mhirst@micb405 /]$ pwd To go home [mhirst@micb405 /]$ cd~
When do you get a mapping quality of zero?
When a read aligns to more than one region in the reference genome.
How can whole genome sequencing replace phenotype drug resistance testing?
Whole genome sequencing can be aligned to a reference database of mutations and resistance genes to determine if the bacteria/virus is drug resistant. Ofcourse this would only work if the reference database has the drug resistance gene/ mutation recorded. This cannot be used to predict minimum inhibitory concentrations.
What are xenologs?
Xenologs are homologs produced by lateral gene transfer between species
How do you copy one file to another file?
[mhirst@micb405 /] $ cp file1 file2 [enter]
How do you display the contents of a file on the screen?
[mhirst@micb405 /]$ cat file1 [enter]
How do you change a directory?
[mhirst@micb405 /]$ cd genomes [enter] Changes current working directory to genomes [mhirst@micb405 /]$ cd .. changes working directory to parent of the current working directory [mhirst@micb405 /]$ cd . changes directory to current directory.
What is the command to run a fastqc file on the home directory and direct the output to the home directory
[mhirst@micb405 /]$ fastqc /home/mhirst/ file.fastq -o /home/mhirst/ & [enter]
What is the second command used to search a text file for a sequence?
[mhirst@micb405 /]$ grep AAATAC assemble.fa [enter] Options -v display lines that do not match -n display each matching line with the line number -c print only the total count of matched lines -i ignore case
How do you display the first ten lines of a file?
[mhirst@micb405 /]$ head file1[enter]
How do you display the contents of a file one screen at a time?
[mhirst@micb405 /]$ less file1 [enter]
What is the first command used to search a text file for a sequence?
[mhirst@micb405 /]$ less file1.fa [enter] [mhirst@micb405 /]$/AATACT [enter] type [n] to search for the next occurence of the pattern
How can you modify the ls command?
[mhirst@micb405 /]$ ls -a can list all files including hidden files that start with a '.'
How do you make and see a directory?
[mhirst@micb405 /]$ mkdir genomics [enter] [mhirst@micb405 /]$ ls [enter]
What is the difference between the mv file and the copy file commands?
[mhirst@micb405 /]$ mv file1 file2[enter] moves the data in file 1 to file 2 so that you only end up with one data filled file.
How do you delete a directory
[mhirst@micb405 /]$ rmdir directory1 [enter]
How do you delete a file?
[mhirst@micb405 /]$ rmfile file1 [enter]
What does the command tail do?
[mhirst@micb405 /]$ tail file1 [enter] Displays the last ten lines of a file
How do you calculate the number of words in a file? How do you calculate the number of lines in a file
[mhirst@micb405 /]$ wc file 1 [enter] -l count the number of lines
ASCII number
saves space Dec value - 64 = Q ASCII number e.g. for h 1.04-64 = Q40 ASCII number
What method should be used to find out more information about a pathogen after it has been identified?
whole genome sequencing should be used. It is easiest to sequence from a pure culture although you can do it using a sample as well.The choice of sequencing platform primarily depends on the desired throughput. Reference assembly of the genome may be performed for simpler organisms but for more complex organisms, de novo assembly is better.
