DADA2 ITS Package Tutorial
OTU( operational taxonomic unit)
-a neutral way to refer to samples in a phylogeny -operational definition used to classify groups of closely related individuals; can also be defined as a collection of 16S rRNA sequences that have a certain percentage of sequence divergence
adonis2()
Analysis of variance using distance matrices — for partitioning distance matrices among sources of variation and fitting linear models (e.g., factors, polynomial regression) to distance matrices; uses a permutation test with pseudo-FF ratios.
cast()
Cast functions Cast a molten data frame into an array or data frame.
dir.create()
Create a directory in the current working directory
getUniques()
Get the uniques-vector from the input object - this function extracts the uniques-vector from several different data objects, including dada-class and derep-class objects, as well as data.frame objects that have both sequences and abundance columns. The return value is an integer vector named by sequence. and valued by abundance. if the input is already in uniques-vector format, the same vector will be returned.
getSequences()
Get vector of sequneces from input object - This function extracts the sequences from several different data objects, including dada-class and derep_class objects, as well as data frame objects that have both sequences and abundance columns. This function wraps the getUniques function, but return only the names (i.e. sequences). Can also be provided the file path to a fasta or fastq file, a taxonomy table, or a DNAStringSet object. Sequences are coerced to upper-case characters.
ITS region
Internal transcribed spacer) most widely sequenced DNA region in molecular ecology of fungi and has been recommended as the universal fungal barcode sequence
rarefy_even_depth
Resample an OTU table such that all samples have the same library size
colnames()
Row and column names - get or set the row or column names of a matrix-like object
labs()
Modify axis, legend, and plot labels
estimate_richness
Performs a number of standard alpha diversity estimates, and returns the results as a data.frame. alpha diversity- diversity on a local scale, describing the species diversity (richness) within a functional community
scale_colour_brewer
Sequential, diverging and qualitative colour scales from colorbrewer.org
tax_glom()
This method merges species that have the same taxonomy at a certain taxaonomic rank. Its approach is analogous to tip_glom, but uses categorical data instead of a tree. In principal, other categorical data known for all taxa could also be used in place of taxonomy, but for the moment, this must be stored in the taxonomyTable of the data. Also, columns/ranks to the right of the rank chosen to use for agglomeration will be replaced with NA, because they should be meaningless following agglomeration.
facet_wrap()
Wrap a 1d ribbon of panels into 2d
llumina Paired-end sequencing fastq files
a file that has the result from the sequencing of both ends of a fragment. This allows generation of high-quality, alignable sequence data. Pair end sequencing facilitates detection of genomic rearrangements and repetitive sequence elements, as well as gene fusions and novel transcripts
Amplicon sequence variant (ASV) table
a higher-resolution analogue of the traditional OUT table, which records the number of times each exact amplicon sequence variant was observed in each sample
Anova()
a statistical test for estimating how a quantitative dependent variable changes according to the levels of one or more categorical independent variables.
IdTaxa
algorithm can quickly and accurately classify nucleotide or amino acid sequences into a taxonomy of organisms or functions:
sapply
apply a function over a list-like or vector-like object - lapply returns a list of the same length as a x, each element of which is the result of applying FUN to the corresponding element of x. - sapply is a user-friendly version and wrapper of lapply
Amplicons(PCR product)
are DNA products of a polymerase chain reaction
rowSums()
calculates the totals for each row of a matrix Form Row and Column Sums and Means
rbind
combine objects by rows or columns - rbind and cbind take one or more objects and combine them by columns or rows, respectively
vegdist
computes dissimilarity indices that are useful for or popular with community ecologists
do.call()
constructs and executes a function call from a name or a function and a list of arguments to be passed to it
as.character()
converts a numeric object to a string data type or a character object. If the collection is passed to it as an object, it converts all the elements of the collection to a character or string type.
as.data.frame()
converts objects into a data frame
matrix
creates a matrix from the given set of values.
data.frame( )
creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.
maxN
default 0 After truncation, sequences with more than maxN will be discarded. Note that dada does not allow Ns.
aes()
often used within other graphing elements to specify the desired aesthetics
saveRDS and readRDS
provide the means to save a single R object to a connection (typically a file) and to restore the object
rc(sq)
reverse complement DNA sequences - this function reverse complements DNA sequences provided. This function is nothing more than a concisely-named convenience wrapper for reversecomplement that handles the character vector DNA sequences generated in the dada2 package
seq_along(along. with)
sequence generation - generate regular sequences.
reverseComplement
sequence reversing and complementing -use these functions for reversing sequences and/or complementing DNA or RNA sequences
Demultiplexed
split into individual per sample fastq files
vcountPattern
string searching functions - A set of functions for finding all the occurrences (aka "matches" or "hits") of a given pattern (typically short) in a (typically long) reference sequence or set of reference sequences (aka the subject)
merge_phyloseq
takes a comma-separated list of phyloseq objects as arguments, and returns the most-comprehensive single phyloseq object possible.
str.split()
Split and Elements of a Character Vector - Split the elements of a character vector x into substrings according to the matches to substring split within them
ddply()
Split data frame, apply function, and return results in a data frame.
cbind
Combine objects by rows or columns - rbind and cbind take one or more objects and combine them by columns or rows, respectively
cutadapt
finds and removes adapter sequences, primers, poly-A-tails and other types of unwanted sequences from high-thouroughput sequencing reads
ITS1 and ITS2
forward and reverse primers respectively.
Chimerism
in genetics, the presence of cells of different origin in an individual, whether by mutation, transplant, or some other process; named from the chimera, a hybrid monster depicted as an amalgam of a lion, goat, and serpent
theme()
is used to control non-data parts of the graph including
lm()
is used to fit linear models
DADA2
package infers exact amplicon sequence variants from high-throughput amplicon sequencing data, replacing the coarser and less accurate OTU clustering approach. The end product is an ASV table, like "seqtab.nochim_ITS.rds", and assignment of taxonomy to the output sequences, like "taxid_ITS.rds"
Rcpp
package that provides R functions as well as a C++ classes which offer a seamless integration of R and C++
Requirements before beginning dada2
-Samples must be demultiplexed -Non-biological nucleotides have been removed, e.g. primers, adapters, linkers,etc -If pair-end sequencing data, the forward and reverse fastq files contain reads in matched order
geom_bar()
Bar charts
Shortread
Bioconductor package that is a class of short read. this provides a way to store and manipulate, in a coordinated fashion, uniform-length short reads and their identifiers.
assignTaxonomy
Classifies sequences against reference training dataset
aggregate()
Compute Summary Statistics of Data Subsets
DNAString()
DNAString objects - A DNAString object allows efficient storage and manipulation of a long DNA sequence
dim(x)
Dimension of an object - retrieve or set the dimension of an object
dada()
High resolution sample inference from amplicon data - The dada function takes as input dereplicated amplicon sequencing reads and returns the inferred composition of the sample (or samples). Put another way, dada removes all sequencing errors to reveal the members of the sequenced community.
psmelt()
Melt phyloseq data object into large data.frame
derepFastq()
Read in and dereplicate a fastq file - A custom interface to FastqStreamer for dereplicating amplicon sequences from fastq or compressed fastq files, while also controlling peak memory requirement to support large files
removaebimeradenovo()
Remove bimeras from collections of unique sequences - This function is a convenience interface for chimera removal. Two methods to identify chimeras are supported: Identification by consensus across samples. Sequence variants identified as bimeric are removed, and a bikers-free collection of unique sequences returned.
transform_sample_counts()
This function transforms the sample counts of a taxa abundance matrix according to a user-provided function. The counts of each sample will be transformed individually. No sample-sample interaction/comparison is possible by this method.
DECIPHER
Tools for curating, analyzing, and manipulating biological sequences - is a software toolset that can be used for deciphering and managing biological sequences efficiently using the R statistical programming language. The program is designed to be used with a non-destructive workflows for importing, maintaining, analyzing, manipulating, and exporting a massive amount of sequences
filterAndTrim
filter and trims an imput fastq file (can be compressed) based on several user-definable criteria, and outputs fast files (compressed by default) containing those trimmed reads which passed the filters. Corresponding forward and reverse fastq files can be provided as input, in which case filtering is performed on the forward and reverse reads independently, and both reads must pass for the read pair to be output.