Skip to content

GamaPintoLab/fastfilter_v2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fastfilter

High-performance FASTQ quality filter for paired-end and single-end RNA-seq data.

Python 3.8+ License: MIT Platform


Overview

fastfilter is a fast, memory-efficient command-line tool for filtering FASTQ files generated by high-throughput RNA sequencing. It supports paired-end and single-end modes, processes multiple samples in parallel, and produces per-sample quality reports.

Developed at the RNA Systems Biology Lab, BioISI — Biosystems and Integrative Sciences Institute, Faculty of Sciences, University of Lisbon.


Features

  • Paired-end and single-end filtering in a single tool
  • Five independent filters per read:
    • Minimum sequence length
    • Minimum mean Phred quality score
    • Homopolymer detection (A/T/G/C runs)
    • Configurable N base content threshold
    • Dot (.) character rejection
  • Strict pair synchronisation — output R1 and R2 always have identical read counts
  • Runtime mismatch detection — aborts with a clear error if input files have different read counts
  • Parallel processing — multiple samples processed simultaneously
  • gzip support — reads and writes .fastq.gz; uses isal for 2–4× faster decompression when available
  • Per-sample summary CSV — vertical format, one metric per row, includes all filter parameters and length statistics
  • Screen / nohup safe — timestamped checkpoint prints when running detached; tqdm bar when interactive
  • Sequence normalisation — all bases uppercased at read time

Requirements

  • Python 3.8 or newer
  • tqdm
  • isal (optional — faster gzip; stdlib fallback is automatic)
pip install -r requirements.txt

Installation

git clone https://github.com/GamaPintoLab/fastfilter_v2
cd fastfilter
pip install -r requirements.txt
chmod +x fastfilter.py

No build step required. The script runs directly.


Quick Start

Paired-end — multiple samples in parallel:

./fastfilter.py \
  -r1 sample1_R1.fastq sample2_R1.fastq \
  -r2 sample1_R2.fastq sample2_R2.fastq \
  -o results/ \
  -j 4

Paired-end — gzip input:

./fastfilter.py \
  -r1 *_R1.fastq.gz \
  -r2 *_R2.fastq.gz \
  -o results/

Single-end:

./fastfilter.py \
  -r sample.fastq \
  -o results/

Custom thresholds:

./fastfilter.py \
  -r1 sample_R1.fastq \
  -r2 sample_R2.fastq \
  -l 50 -s 35 -p 20 -n 2 \
  -o results/

Arguments

Flag Long form Default Description
-r1 --r1-files R1 (forward) FASTQ file(s) — paired-end mode
-r2 --r2-files R2 (reverse) FASTQ file(s) — paired-end mode
-r --reads FASTQ file(s) — single-end mode
-o --output-dir <input_dir>/fastfilter/ Output directory (created if absent)
-l --minlen 25 Minimum sequence length (bp)
-s --min-score 30 Minimum mean Phred quality score
-p --homopolymerlen 25 Homopolymer run length threshold
-n --max-n 0 Maximum N bases allowed per read
-j --cpus 1 Number of parallel worker processes
-Z off Use compression level 1 (fast) instead of default level 6

-r1 / -r2 / -r accept multiple files. The i-th R1 file is paired with the i-th R2 file.


Output Files

For each input sample, the following files are written to the output directory:

File Description
<stem>.filtered.fastq[.gz] Reads that passed all filters
<stem>.summary.csv Per-sample quality report

Output format (plain or gzip) matches the input automatically.


Summary Report

Each .summary.csv uses a vertical metric,value format for readability in any text editor or spreadsheet.

Paired-end fields (R1 report):

Metric Description
sample Sample name
r1_file R1 input filename
total_reads Total read pairs processed
passed_reads Pairs where both mates passed all filters
failed_reads Pairs that did not pass
pct_pairs_passed Percentage of pairs passed
r1_pass_rate Percentage of R1 reads passing individually
lost_due_to_r1_fail Pairs lost because R1 failed (R2 was fine)
failed_both Pairs where both mates failed
r1_too_short R1 reads below minimum length
r1_n R1 reads exceeding N threshold
r1_dot R1 reads containing . characters
r1_homopolymer R1 reads with a homopolymer run
r1_low_score R1 reads below minimum quality score
r1_len_min/max/mean/median Read length statistics
min_length Length threshold used
homopolymer_len Homopolymer threshold used
min_score Quality threshold used
max_n_allowed N threshold used
elapsed_min Total wall-clock time (minutes)

Exclusion reason counts are not mutually exclusive — a read failing multiple filters is counted in each applicable category.


Filtering Logic

All five filters are evaluated independently per read. A pair is written to output only when both mates pass all filters.

For each read:
  1. len(seq) >= min_length
  2. mean_phred(qual) >= min_score        where mean_phred = mean(ord(c) - 33 for c in qual)
  3. seq.count('N') <= max_n
  4. '.' not in seq
  5. no homopolymer run of length >= homopolymer_len  (A, T, G, or C)

Performance Notes

Factor Detail
gzip backend Install isal for 2–4× faster I/O on .fastq.gz files. Active backend shown at startup.
Parallelism -j N processes N samples simultaneously. Workers are capped to the number of samples.
Memory Length statistics use Counter — constant memory regardless of file size. No reads held in RAM.
Progress tqdm bar in interactive single-worker sessions; timestamped checkpoint lines every 5M reads otherwise.

Example Terminal Output

fastfilter — 2026-04-15 11:40:32
  Output dir  : results/
  Min length  : 25
  Min score   : 30
  Homopolymer : 25
  Max N       : 0
  Compression : level 6  [isal (fast)]
  Mode        : paired-end | 2 sample(s) | 2 CPU(s)

  [1] sample1_R1.fastq  +  sample1_R2.fastq
  [2] sample2_R1.fastq  +  sample2_R2.fastq

[11:40:39] sample1_R1.fastq: Finished. 245169 / 250000 passed.
[11:40:41] sample2_R1.fastq: Finished. 241083 / 250000 passed.

[11:40:41] All done. Ran in 0.15 min.

Repository Structure

fastfilter/
├── fastfilter.py        # Main script
├── requirements.txt     # Python dependencies
├── CHANGELOG.md         # Version history
├── LICENSE              # MIT License
└── README.md            # This file

Citation

If you use fastfilter in your research, please acknowledge:

Monteiro, L. (2026). fastfilter: High-performance FASTQ quality filter for RNA-seq data (v2.0). RNA Systems Biology Lab, BioISI, Faculty of Sciences, University of Lisbon.


License

MIT License — see LICENSE for details.


Author

Lucas Monteiro
RNA Systems Biology Lab
BioISI — Biosystems and Integrative Sciences Institute
Department of Chemistry and Biochemistry
Faculty of Sciences, University of Lisbon
ldmonteiro@fc.ul.pt

About

Debugging of fastfilter repository

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages