micca-otu-denovoΒΆ

Performs the OTU clustering the taxonomy assigment. Reads are clustered without any external reference. The output directory will contain the following:

clusters.txt

a tab-delimited file where each row contains the sequence identifiers assigned to the cluster. The first id corresponds to a representative sequence. Sequence identifiers are coded as SAMPLE_NAME||SEQ_ID:

sample1||F4HTPAO07H4B1Q sample1||F4HTPAO07ILHKH sample1||F4HTPAO07H8VJE  ...
sample3||F4HTPAO05FO0LC sample2||F4HTPAO02BVI74 sample3||F4HTPAO05FQCOF ...
...
otu_table.txt

a tab-delimited file containing the number of times an OTU is found in each sample. The first column contains the representative sequence id:

OTU                     sample1 sample2 sample3
sample1||F4HTPAO07H4B1Q 12      5       4
sample3||F4HTPAO05FO0LC 2       6       6
...
representatives.fasta

a FASTA file containing the representative sequence for each OTU:

>sample1||F4HTPAO07H4B1Q
GTCCACGCCGTAAACGGTGGATGCTGGATGTGGGGCCCGTTCCACGGGTTCCGTGTCGGA
GCTAACGCGTTAAGCATCCCGCCTGGGGAGTACGGCCGCAAGGCTAAAACTCAAAGAAAT
TGACGGGGCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCT
TACCTGGGCTTGACATGTTCCCGACGGTCGTAGAGATACGGCTTCCCTTCGGGGCGGGTT
CACAGGTGGTGCATGGTC
>sample3||F4HTPAO05FO0LC
GTCCACGCCGTAAACGATGAATACTAGGTGTTGGGAAGCATTGCTTCTCGGTGCCGTCGC
AAACGCAGTAAGTATTCCACCTGGGGAGTACGTTCGCAAGAATGAAACTCAAAGGAATTG
ACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTT
ACCAAGTCTTGACATCCTTCTGACCGGTACTTAACCGTACCTTCTCTTCGGAGCAGGAGT
GACAGGTGGTGCATGGTT
...
taxonomy.txt

a two-columns, tab-delimited file containing the taxonomy assigned to each OTU:

sample1||F4HTPAO07H4B1Q Bacteria;Actinobacteria;Actinobact...
sample3||F4HTPAO05FO0LC Bacteria;Firmicutes;Clostridia;Clost...
...
otu.log
the log file.
$ micca-otu-denovo --help
usage: micca-otu-denovo [-h] [-v] [-M] [-f {fastq,fasta}] [-s SIMILARITY]
                        [-m SIZE] [-c] [-d] [-l DEREP_FAST_LEN]
                        [-t {rdp,blast}] [-o DIR] [--rdp-max-memory MB]
                        [--rdp-min-confidence CONFIDENCE] [--rdp-gene GENE]
                        [--blast-ref FILE] [--blast-ref-taxonomy FILE]
                        [--blast-num-threads NUM] [--blast-e-value VALUE]
                        [--blast-perc-identity PERC]
                        input [input ...]

micca-otu-denovo performs the OTU clustering the taxonomy assigment. Reads are
clustered without any external reference. The output directory will contain
the following:
 
clusters.txt - a tab-delimited file where each row contains the sequence 
               identifiers assigned to the cluster
otu_table.txt - a tab-delimited file containing the number of times an OTU 
                is found in each sample
representatives.fasta - a FASTA file containing the representative sequence
                        for each OTU
taxonomy.txt - a two-columns, tab-delimited file containing the taxonomy
               assigned to each OTU
otu.log - the log file

positional arguments:
  input                 input fastq/a file(s)

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -M, --merged          the input file is a single fasta/q file where samples
                        were merged and the sequences ids are in the form
                        'SAMPLENAME||SEQID'
  -f {fastq,fasta}, --format {fastq,fasta}
                        input file format (default fastq)
  -s SIMILARITY, --similarity SIMILARITY
                        similarity between cluster center and cluster
                        sequences (default 0.97)
  -m SIZE, --minsize SIZE
                        minimum size for a cluster (e.g. 2 removes singletons)
                        (default 2)
  -c, --remove-chimeras
                        remove chimeric sequences (recommended)
  -d, --derep-fast      fastest (prefix based) but less accurate dereplication
                        (recommended for dataset with 200000+ seqs)
  -l DEREP_FAST_LEN, --derep-fast-len DEREP_FAST_LEN
                        prefix length used in fast dereplication (default 200)
  -t {rdp,blast}, --taxonomy {rdp,blast}
                        protocol for taxonomy assignment(default rdp)
  -o DIR, --output-dir DIR
                        output directory (default .)

Taxonomy assignment with RDP Classifier/Database:
  --rdp-max-memory MB   maximum memory size for the java virtual machine in MB
                        (default 1000)
  --rdp-min-confidence CONFIDENCE
                        minimum confidence value to assign taxonomy to a
                        sequence (default 0.8)
  --rdp-gene GENE       16srrna, fungallsu, fungalits_warcup (RDP classifier
                        v.2.8 only), fungalits_unite (RDP classifier v.2.8
                        only) (default 16srrna)

Taxonomy assignment with BLAST:
  --blast-ref FILE      reference sequences in fasta format
  --blast-ref-taxonomy FILE
                        tab-separated id-to-taxonomy file. First column
                        contains the sequence identifiers, and the second
                        column contains the taxonomy separated by semi-colons
                        (e.g., Bacteria;Firmicutes;Clostridia)
  --blast-num-threads NUM
                        number of threads to use in blast search (default 1)
  --blast-e-value VALUE
                        maximum e-value to record an assignment (default
                        1e-29)
  --blast-perc-identity PERC
                        percent identity cutoff (default 90)

Example:

 $ micca-otu-denovo sample1.fastq sample2.fastq sample3.fastq -s 0.95 \
   --rdp-max-memory=2000

micca v. 0.1.
Author: Davide Albanese <davide.albanese@fmach.it>
Fondazione Edmund Mach, 2014.

Previous topic

micca-midpoint-root

Next topic

micca-otu-ref

This Page