fastder is a C++ based tool for detecting expressed regions in RNA-seq data.
It is intended to build on the recount3 resource, which consists of over 750'000 uniformly processed RNA-seq samples across different mouse and human studies.
The tool aims to reconstruct expressed genes prior to splicing in an annotation-agnostic approach.
fastder takes genome-wide coverage files and splice junction coordinates as input. Coverage can be supplied as BedGraph by default, or as BigWig when the build was configured with -DFASTDER_USE_LIBBIGWIG=ON. The tool averages across samples and applies a coverage threshold to identify consecutive regions with above-threshold expression. It then stitches expressed regions (ERs) together when a splice junction in the input matches the end of one ER and the start of the next.
Coverage is held in memory as a sparse list of intervals rather than a dense per-base vector. On chr21 this keeps the resident set in the low hundreds of MB instead of multiple GB at full hg38. Splice junctions are partitioned by strand at the Integrator: each chromosome's expressed regions are walked once per strand, and chains built from junction-linked ERs are tagged + or -. Standalone ERs that no junction connected to a neighbour stay unstranded (.).
The default build needs cmake (4.0 or newer) and a C++20 compiler:
mkdir build
cd build
cmake ..
make -j
To read BigWig coverage directly instead of converting to BedGraph first,
configure with -DFASTDER_USE_LIBBIGWIG=ON. CMake will fetch libBigWig from
GitHub at configure time. zlib and libcurl headers must be available.
cmake -DFASTDER_USE_LIBBIGWIG=ON ..
The unit tests run with ctest from the build directory. Two tests are gated
on the libBigWig option and are skipped in the default build.
recount3 provides RNA-seq data for over 8'000 human and over 10'000 mouse studies. Each study consist of multiple per-sample
coverage bigWig files and one set of per-study splice junction coordinate files amongst others.
These datasets can be downloaded from their online platform.
Thus, the user can either provide data from one of the existing studies or run the recount3 pipeline with new RNA-seq data.
recount3 provides uniformly processed RNA-seq data for over 8'000 human and over 10'000 mouse studies. Each study consists of several thousand samples. Existing input files can be retrieved from the recount3 online platform.
If a user wishes to run fastder on new RNA-seq data, the easiest way to obtain the required input data is to run the recount3 pipeline.
fastder builds on the Monorail pipeline used by recount3. Monorail takes the FASTQ files provided by Illumina Sequencing as an input.
A brief summary of the relevant steps in the Monorail pipeline (used to create recount3 resources) is provided below:
-
Input data:
- unpaired or paired-end FASTQ files
- suffix-array-based index of reference genome sequence
-
Perform spliced alignment with STAR to obtain
- a BAM file with the spliced alignment
- a summary of detected splice junction
-
Use Megadepth to produce bigWig coverage files
-
Aggregate SJ.out.tab into a
- MM file
- RR file
The following diagram provides an overview of the tables and objects used in fastder. The _File suffix indicates that the table is one of the input files.
All other tables are objects created by the Parser class to map between the three different sample IDs (in lilac) used by the splice junction and coverage files respectively.
The following sequence diagram provides an abstracted overview of the three main functional stages of fastder.
fastder can currently take only one RR and MM file as an input. Thus, users directly working with
recount3 resources can only provide samples from the same study as an input.
fastderexpects all input files to be in the same folder (provided as a relative path to the build directory with--dir).fastderallows users to optionally specify which chromosomes they wish to analyze. The flag--chr <chr1>means that the tool will only output expressed regions on chromosome 1, and will ignore all coverage and splice junction information from other chromosomes).fastderallows optionally specifying four different thresholds:--min-coverage 0.25describes the coverage threshold of an expressed region (ER). A consecutive base-pair position must have at least 0.25 CPM coverage to be added to en ER.--min-length 5describes the minimum length (in bp) that an ER must have. For instance, three consecutive base pairs with coverage > 0.25 CPM will be ignored if the min length is set to 5 bp.--position-tolerance 5describes the maximum permitted offset of the end position of an exon and the starting position of a splice junction. If this tolerance is set to 5, an ER with end position = 1000 bp and a splice junction with start position = 1005 bp will be stitched together (if the coverage and end junction match).--coverage-tolerance 0.1describes the maximum permitted coverage deviation between two ER that are separated by a spliced region. For a coverage tolerance of 0.1, two ERs with coverage = 10 CPM and 11 CPM will be stitched together (if there is a matching splice junction).
A visualization of the different parameters is provided below.
Usage:
fastder \
--dir <path> ... \
[--chr <chr1> <chr2> ...] \
[--min-coverage <float>] \
[--position-tolerance <int>] \
[--coverage-tolerance <float>] \
[--help]
Required inputs:
--dir <path> ... Relative path from the build directory to the directory containing the input files.
Example: --dir ../../data/test_exon_skipping
Optional inputs:
--chr <chr1> <chr2> ... List of chromosomes to process.
Default: all (chr1-chr22, chrX)
Example: --chr chr1 chr2 chr3
--min-length <float> Minimum length [#bp] required for a region to qualify as an expressed region (ER).
Default: 5 bp
Example: --min-length 5
--min-coverage <float> Minimum coverage [CPM] required for a region to qualify as an ER.
Normalized in-place by library size.
Default: 0.25 CPM
Example: --min-coverage 0.25
--position-tolerance <int> Maximum allowed positional deviation between splice junction and ER coordinates [bp].
Default: 5 bp
Example: --position-tolerance 5
--coverage-tolerance <float> Allowed relative deviation in coverage between stitched ERs (e.g. 0.1 = 10%).
Default: 0.1
Example: --coverage-tolerance 0.1
--help Show this help message.
Example:
fastder \
--dir ../../data/input \
--chr chr1 chr2 \
--position-tolerance 5 \
--min-length 5 \
--min-coverage 0.25 \
--coverage-tolerance 0.1
GPLv3




