classifyΒΆ

$ micca classify --help
usage: micca classify [-h] -i FILE -o FILE [-m {cons,rdp,otuid}] [-r FILE]
                      [-x FILE] [--cons-id CONS_ID]
                      [--cons-maxhits CONS_MAXHITS]
                      [--cons-minfrac CONS_MINFRAC]
                      [--cons-mincov CONS_MINCOV] [--cons-strand {both,plus}]
                      [--cons-threads THREADS]
                      [--rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}]
                      [--rdp-maxmem GB] [--rdp-minconf RDP_MINCONF]

micca classify assigns taxonomy for each sequence in the input file
and provides three methods for classification:

* VSEARCH-based consensus classifier (cons): input sequences are
  searched in the reference database with VSEARCH
  (https://github.com/torognes/vsearch). For each query sequence the
  method retrives up to 'cons-maxhits' hits (i.e. identity >=
  'cons-id'). Then, the most specific taxonomic label that is
  associated with at least 'cons-minfrac' of the hits is
  assigned. The method is similar to the UCLUST-based consensus
  taxonomy assigner presented in doi: 10.7287/peerj.preprints.934v2
  and available in QIIME.

* RDP classifier (rdp): only RDP classifier version >= 2.8 is
  supported (doi:10.1128/AEM.00062-07). In order to use this
  classifier RDP must be installed (download at
  http://sourceforge.net/projects/rdp-classifier/files/rdp-classifier/)
  and the RDPPATH environment variable setted. The available
  databases (--rdp-gene) are:

  - 16S (16srrna)
  - Fungal LSU (28S) (fungallsu)
  - Warcup ITS (fungalits_warcup, doi: 10.3852/14-293)
  - UNITE ITS (fungalits_unite)

  For more information about the RDP classifier go to
  http://rdp.cme.msu.edu/classifier/classifier.jsp

* OTU ID classifier (otuid): simply perform a sequence ID matching
  with the reference taxonomy file. Recommended strategy when the
  closed reference clustering (--method closedref in micca-otu) was
  performed. OTU ID classifier requires a tab-delimited file where
  the first column contains the current OTU ids and the second column
  the reference taxonomy ids (see otuids.txt in micca-otu), e.g.:

  REF1[TAB]1110191
  REF2[TAB]1104777
  REF3[TAB]1078527
  ...

The input reference taxonomy file (--ref-tax) should be a
tab-delimited file where rows are either in the form:

1. SEQID[TAB]k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__;
2. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales;;;
3. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales

Compatible reference database are Greengenes
(http://greengenes.secondgenome.com/downloads), QIIME-formatted SILVA
(https://www.arb-silva.de/download/archive/qiime/) and UNITE
(https://unite.ut.ee/repository.php).

The output file is a tab-delimited file where each row is in the
format:

SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales

optional arguments:
  -h, --help            show this help message and exit

arguments:
  -i FILE, --input FILE
                        input FASTA file (for 'cons' and 'rdp') or a tab-
                        delimited OTU ids file (for 'otuid') (required).
  -o FILE, --output FILE
                        output taxonomy file (required).
  -m {cons,rdp,otuid}, --method {cons,rdp,otuid}
                        classification method (default cons)
  -r FILE, --ref FILE   reference sequences in FASTA format, required for
                        'cons' classifier.
  -x FILE, --ref-tax FILE
                        tab-separated reference taxonomy file, required for
                        'cons' and 'otuid' classifiers.

VSEARCH-based consensus classifierspecific options:
  --cons-id CONS_ID     sequence identity threshold (0.0 to 1.0, default 0.9).
  --cons-maxhits CONS_MAXHITS
                        number of hits to consider (>=1, default 3).
  --cons-minfrac CONS_MINFRAC
                        for each taxonomic rank, a specific taxa will be
                        assigned if it is present in at least MINFRAC of the
                        hits (0.0 to 1.0, default 0.5).
  --cons-mincov CONS_MINCOV
                        reject sequence if the fraction of alignment to the
                        reference sequence is lower than MINCOV. This
                        parameter prevents low-coverage alignments at the end
                        of the sequences (default 0.75).
  --cons-strand {both,plus}
                        search both strands or the plus strand only (default
                        both).
  --cons-threads THREADS
                        number of threads to use (1 to 256, default 1).

RDP Classifier/Database specific options:
  --rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}
                        marker gene/region
  --rdp-maxmem GB       maximum memory size for the java virtual machine in GB
                        (default 2)
  --rdp-minconf RDP_MINCONF
                        minimum confidence value to assign taxonomy to a
                        sequence (default 0.8)

Examples

Classification of 16S sequences using the consensus classifier and
Greengenes:

    micca classify -m cons -i input.fasta -o tax.txt \
    --ref greengenes_2013_05/rep_set/97_otus.fasta \
    --ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt

Classification of ITS sequences using the RDP classifier and the
UNITE database:

    micca classify -m rdp --rdp-gene fungalits_unite -i input.fasta \
    -o tax.txt

OTU ID matching after the closed reference OTU picking protocol:

    micca classify -m otuid -i otuids.txt -o tax.txt \
    --ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt