High-performance FASTQ quality filter for paired-end and single-end RNA-seq data.
fastfilter is a fast, memory-efficient command-line tool for filtering FASTQ files generated by high-throughput RNA sequencing. It supports paired-end and single-end modes, processes multiple samples in parallel, and produces per-sample quality reports.
Developed at the RNA Systems Biology Lab, BioISI — Biosystems and Integrative Sciences Institute, Faculty of Sciences, University of Lisbon.
- Paired-end and single-end filtering in a single tool
- Five independent filters per read:
- Minimum sequence length
- Minimum mean Phred quality score
- Homopolymer detection (A/T/G/C runs)
- Configurable N base content threshold
- Dot (
.) character rejection
- Strict pair synchronisation — output R1 and R2 always have identical read counts
- Runtime mismatch detection — aborts with a clear error if input files have different read counts
- Parallel processing — multiple samples processed simultaneously
- gzip support — reads and writes
.fastq.gz; uses isal for 2–4× faster decompression when available - Per-sample summary CSV — vertical format, one metric per row, includes all filter parameters and length statistics
- Screen / nohup safe — timestamped checkpoint prints when running detached; tqdm bar when interactive
- Sequence normalisation — all bases uppercased at read time
pip install -r requirements.txtgit clone https://github.com/GamaPintoLab/fastfilter_v2
cd fastfilter
pip install -r requirements.txt
chmod +x fastfilter.pyNo build step required. The script runs directly.
Paired-end — multiple samples in parallel:
./fastfilter.py \
-r1 sample1_R1.fastq sample2_R1.fastq \
-r2 sample1_R2.fastq sample2_R2.fastq \
-o results/ \
-j 4Paired-end — gzip input:
./fastfilter.py \
-r1 *_R1.fastq.gz \
-r2 *_R2.fastq.gz \
-o results/Single-end:
./fastfilter.py \
-r sample.fastq \
-o results/Custom thresholds:
./fastfilter.py \
-r1 sample_R1.fastq \
-r2 sample_R2.fastq \
-l 50 -s 35 -p 20 -n 2 \
-o results/| Flag | Long form | Default | Description |
|---|---|---|---|
-r1 |
--r1-files |
— | R1 (forward) FASTQ file(s) — paired-end mode |
-r2 |
--r2-files |
— | R2 (reverse) FASTQ file(s) — paired-end mode |
-r |
--reads |
— | FASTQ file(s) — single-end mode |
-o |
--output-dir |
<input_dir>/fastfilter/ |
Output directory (created if absent) |
-l |
--minlen |
25 |
Minimum sequence length (bp) |
-s |
--min-score |
30 |
Minimum mean Phred quality score |
-p |
--homopolymerlen |
25 |
Homopolymer run length threshold |
-n |
--max-n |
0 |
Maximum N bases allowed per read |
-j |
--cpus |
1 |
Number of parallel worker processes |
-Z |
— | off | Use compression level 1 (fast) instead of default level 6 |
-r1/-r2/-raccept multiple files. The i-th R1 file is paired with the i-th R2 file.
For each input sample, the following files are written to the output directory:
| File | Description |
|---|---|
<stem>.filtered.fastq[.gz] |
Reads that passed all filters |
<stem>.summary.csv |
Per-sample quality report |
Output format (plain or gzip) matches the input automatically.
Each .summary.csv uses a vertical metric,value format for readability in any text editor or spreadsheet.
Paired-end fields (R1 report):
| Metric | Description |
|---|---|
sample |
Sample name |
r1_file |
R1 input filename |
total_reads |
Total read pairs processed |
passed_reads |
Pairs where both mates passed all filters |
failed_reads |
Pairs that did not pass |
pct_pairs_passed |
Percentage of pairs passed |
r1_pass_rate |
Percentage of R1 reads passing individually |
lost_due_to_r1_fail |
Pairs lost because R1 failed (R2 was fine) |
failed_both |
Pairs where both mates failed |
r1_too_short |
R1 reads below minimum length |
r1_n |
R1 reads exceeding N threshold |
r1_dot |
R1 reads containing . characters |
r1_homopolymer |
R1 reads with a homopolymer run |
r1_low_score |
R1 reads below minimum quality score |
r1_len_min/max/mean/median |
Read length statistics |
min_length |
Length threshold used |
homopolymer_len |
Homopolymer threshold used |
min_score |
Quality threshold used |
max_n_allowed |
N threshold used |
elapsed_min |
Total wall-clock time (minutes) |
Exclusion reason counts are not mutually exclusive — a read failing multiple filters is counted in each applicable category.
All five filters are evaluated independently per read. A pair is written to output only when both mates pass all filters.
For each read:
1. len(seq) >= min_length
2. mean_phred(qual) >= min_score where mean_phred = mean(ord(c) - 33 for c in qual)
3. seq.count('N') <= max_n
4. '.' not in seq
5. no homopolymer run of length >= homopolymer_len (A, T, G, or C)
| Factor | Detail |
|---|---|
| gzip backend | Install isal for 2–4× faster I/O on .fastq.gz files. Active backend shown at startup. |
| Parallelism | -j N processes N samples simultaneously. Workers are capped to the number of samples. |
| Memory | Length statistics use Counter — constant memory regardless of file size. No reads held in RAM. |
| Progress | tqdm bar in interactive single-worker sessions; timestamped checkpoint lines every 5M reads otherwise. |
fastfilter — 2026-04-15 11:40:32
Output dir : results/
Min length : 25
Min score : 30
Homopolymer : 25
Max N : 0
Compression : level 6 [isal (fast)]
Mode : paired-end | 2 sample(s) | 2 CPU(s)
[1] sample1_R1.fastq + sample1_R2.fastq
[2] sample2_R1.fastq + sample2_R2.fastq
[11:40:39] sample1_R1.fastq: Finished. 245169 / 250000 passed.
[11:40:41] sample2_R1.fastq: Finished. 241083 / 250000 passed.
[11:40:41] All done. Ran in 0.15 min.
fastfilter/
├── fastfilter.py # Main script
├── requirements.txt # Python dependencies
├── CHANGELOG.md # Version history
├── LICENSE # MIT License
└── README.md # This file
If you use fastfilter in your research, please acknowledge:
Monteiro, L. (2026). fastfilter: High-performance FASTQ quality filter for RNA-seq data (v2.0). RNA Systems Biology Lab, BioISI, Faculty of Sciences, University of Lisbon.
MIT License — see LICENSE for details.
Lucas Monteiro
RNA Systems Biology Lab
BioISI — Biosystems and Integrative Sciences Institute
Department of Chemistry and Biochemistry
Faculty of Sciences, University of Lisbon
✉ ldmonteiro@fc.ul.pt