fastfilter

High-performance FASTQ quality filter for paired-end and single-end RNA-seq data.

Overview

fastfilter is a fast, memory-efficient command-line tool for filtering FASTQ files generated by high-throughput RNA sequencing. It supports paired-end and single-end modes, processes multiple samples in parallel, and produces per-sample quality reports.

Developed at the RNA Systems Biology Lab, BioISI — Biosystems and Integrative Sciences Institute, Faculty of Sciences, University of Lisbon.

Features

Paired-end and single-end filtering in a single tool
Five independent filters per read:
- Minimum sequence length
- Minimum mean Phred quality score
- Homopolymer detection (A/T/G/C runs)
- Configurable N base content threshold
- Dot (.) character rejection
Strict pair synchronisation — output R1 and R2 always have identical read counts
Runtime mismatch detection — aborts with a clear error if input files have different read counts
Parallel processing — multiple samples processed simultaneously
gzip support — reads and writes .fastq.gz; uses isal for 2–4× faster decompression when available
Per-sample summary CSV — vertical format, one metric per row, includes all filter parameters and length statistics
Screen / nohup safe — timestamped checkpoint prints when running detached; tqdm bar when interactive
Sequence normalisation — all bases uppercased at read time

Requirements

Python 3.8 or newer
tqdm
isal (optional — faster gzip; stdlib fallback is automatic)

pip install -r requirements.txt

Installation

git clone https://github.com/GamaPintoLab/fastfilter_v2
cd fastfilter
pip install -r requirements.txt
chmod +x fastfilter.py

No build step required. The script runs directly.

Quick Start

Paired-end — multiple samples in parallel:

./fastfilter.py \
  -r1 sample1_R1.fastq sample2_R1.fastq \
  -r2 sample1_R2.fastq sample2_R2.fastq \
  -o results/ \
  -j 4

Paired-end — gzip input:

./fastfilter.py \
  -r1 *_R1.fastq.gz \
  -r2 *_R2.fastq.gz \
  -o results/

Single-end:

./fastfilter.py \
  -r sample.fastq \
  -o results/

Custom thresholds:

./fastfilter.py \
  -r1 sample_R1.fastq \
  -r2 sample_R2.fastq \
  -l 50 -s 35 -p 20 -n 2 \
  -o results/

Arguments

Flag	Long form	Default	Description
`-r1`	`--r1-files`	—	R1 (forward) FASTQ file(s) — paired-end mode
`-r2`	`--r2-files`	—	R2 (reverse) FASTQ file(s) — paired-end mode
`-r`	`--reads`	—	FASTQ file(s) — single-end mode
`-o`	`--output-dir`	`<input_dir>/fastfilter/`	Output directory (created if absent)
`-l`	`--minlen`	`25`	Minimum sequence length (bp)
`-s`	`--min-score`	`30`	Minimum mean Phred quality score
`-p`	`--homopolymerlen`	`25`	Homopolymer run length threshold
`-n`	`--max-n`	`0`	Maximum N bases allowed per read
`-j`	`--cpus`	`1`	Number of parallel worker processes
`-Z`	—	off	Use compression level 1 (fast) instead of default level 6

-r1 / -r2 / -r accept multiple files. The i-th R1 file is paired with the i-th R2 file.

Output Files

For each input sample, the following files are written to the output directory:

File	Description
`<stem>.filtered.fastq[.gz]`	Reads that passed all filters
`<stem>.summary.csv`	Per-sample quality report

Output format (plain or gzip) matches the input automatically.

Summary Report

Each .summary.csv uses a vertical metric,value format for readability in any text editor or spreadsheet.

Paired-end fields (R1 report):

Metric	Description
`sample`	Sample name
`r1_file`	R1 input filename
`total_reads`	Total read pairs processed
`passed_reads`	Pairs where both mates passed all filters
`failed_reads`	Pairs that did not pass
`pct_pairs_passed`	Percentage of pairs passed
`r1_pass_rate`	Percentage of R1 reads passing individually
`lost_due_to_r1_fail`	Pairs lost because R1 failed (R2 was fine)
`failed_both`	Pairs where both mates failed
`r1_too_short`	R1 reads below minimum length
`r1_n`	R1 reads exceeding N threshold
`r1_dot`	R1 reads containing `.` characters
`r1_homopolymer`	R1 reads with a homopolymer run
`r1_low_score`	R1 reads below minimum quality score
`r1_len_min/max/mean/median`	Read length statistics
`min_length`	Length threshold used
`homopolymer_len`	Homopolymer threshold used
`min_score`	Quality threshold used
`max_n_allowed`	N threshold used
`elapsed_min`	Total wall-clock time (minutes)

Exclusion reason counts are not mutually exclusive — a read failing multiple filters is counted in each applicable category.

Filtering Logic

All five filters are evaluated independently per read. A pair is written to output only when both mates pass all filters.

For each read:
  1. len(seq) >= min_length
  2. mean_phred(qual) >= min_score        where mean_phred = mean(ord(c) - 33 for c in qual)
  3. seq.count('N') <= max_n
  4. '.' not in seq
  5. no homopolymer run of length >= homopolymer_len  (A, T, G, or C)

Performance Notes

Factor	Detail
gzip backend	Install `isal` for 2–4× faster I/O on `.fastq.gz` files. Active backend shown at startup.
Parallelism	`-j N` processes N samples simultaneously. Workers are capped to the number of samples.
Memory	Length statistics use `Counter` — constant memory regardless of file size. No reads held in RAM.
Progress	tqdm bar in interactive single-worker sessions; timestamped checkpoint lines every 5M reads otherwise.

Example Terminal Output

fastfilter — 2026-04-15 11:40:32
  Output dir  : results/
  Min length  : 25
  Min score   : 30
  Homopolymer : 25
  Max N       : 0
  Compression : level 6  [isal (fast)]
  Mode        : paired-end | 2 sample(s) | 2 CPU(s)

  [1] sample1_R1.fastq  +  sample1_R2.fastq
  [2] sample2_R1.fastq  +  sample2_R2.fastq

[11:40:39] sample1_R1.fastq: Finished. 245169 / 250000 passed.
[11:40:41] sample2_R1.fastq: Finished. 241083 / 250000 passed.

[11:40:41] All done. Ran in 0.15 min.

Repository Structure

fastfilter/
├── fastfilter.py        # Main script
├── requirements.txt     # Python dependencies
├── CHANGELOG.md         # Version history
├── LICENSE              # MIT License
└── README.md            # This file

Citation

If you use fastfilter in your research, please acknowledge:

Monteiro, L. (2026). fastfilter: High-performance FASTQ quality filter for RNA-seq data (v2.0). RNA Systems Biology Lab, BioISI, Faculty of Sciences, University of Lisbon.

License

MIT License — see LICENSE for details.

Author

Lucas Monteiro
RNA Systems Biology Lab
BioISI — Biosystems and Integrative Sciences Institute
Department of Chemistry and Biochemistry
Faculty of Sciences, University of Lisbon
✉ ldmonteiro@fc.ul.pt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fastfilter

Overview

Features

Requirements

Installation

Quick Start

Arguments

Output Files

Summary Report

Filtering Logic

Performance Notes

Example Terminal Output

Repository Structure

Citation

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
fastfilter.py		fastfilter.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

fastfilter

Overview

Features

Requirements

Installation

Quick Start

Arguments

Output Files

Summary Report

Filtering Logic

Performance Notes

Example Terminal Output

Repository Structure

Citation

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages