filterΒΆ

$ micca filter --help
usage: micca filter [-h] -i FILE -o FILE [-e MAXEERATE] [-m MINLEN] [-t]
                    [-n MAXNS] [-f {fastq,fasta}]

micca filter filters sequences according to the maximum allowed
expected error (EE) rate %%. Optionally, you can:

* discard sequences that are shorter than the specified length
  (suggested for Illumina overlapping paired-end (already merged)
  reads) (option --minlen MINLEN);

* discard sequences that are shorter than the specified length AND
  truncate sequences that are longer (suggested for Illumina and 454
  unpaired reads) (options --minlen MINLEN --trunc);

* discard sequences that contain more than a specified number of Ns
  (--maxns).

Sequences are first shortened and then filtered. Overlapping paired
reads with should be merged first (using micca-mergepairs) and then
filtered.

The expected error (EE) rate %% in a sequence of length L is defined
as (doi: 10.1093/bioinformatics/btv401):

                 sum(error probabilities)
    EE rate %% = ------------------------ * 100
                            L

Before filtering, run 'micca filterstats' to see how many reads will
pass the filter at different minimum lengths with or without
truncation, given a maximum allowed expected error rate %% and maximum
allowed number of Ns.

micca-filter is based on VSEARCH (https://github.com/torognes/vsearch).

optional arguments:
  -h, --help            show this help message and exit

arguments:
  -i FILE, --input FILE
                        input FASTQ file, Sanger/Illumina 1.8+ format
                        (phred+33) (required).
  -o FILE, --output FILE
                        output FASTA/FASTQ file (required).
  -e MAXEERATE, --maxeerate MAXEERATE
                        discard sequences with more than the specified expeced
                        error rate % (values <=1%, i.e. less or equal than one
                        error per 100 bases, are highly recommended).
                        Sequences are discarded after truncation (if enabled)
                        (default 1).
  -m MINLEN, --minlen MINLEN
                        discard sequences that are shorter than MINLEN
                        (default 1).
  -t, --trunc           truncate sequences that are longer than MINLEN
                        (disabled by default).
  -n MAXNS, --maxns MAXNS
                        discard sequences with more than the specified number
                        of Ns. Sequences are discarded after truncation
                        (disabled by default).
  -f {fastq,fasta}, --output-format {fastq,fasta}
                        file format (default fasta).

Examples

Truncate reads at 300 bp, discard low quality sequences
(with EE rate > 0.5%%) and write a FASTA file:

    micca filter -i reads.fastq -o filtered.fasta -m 300 -t -e 0.5