Skip to content

Specification about DMR.csv #20

@Yijun-Tian

Description

@Yijun-Tian

Hi @hanyangii,
Could you share more details about the preparation of the DMR.csv? I tried to fine-tune based on a dorado called BAM file, but the fine-tune doesn't seem to work:

 methylbert preprocess_finetune --methylcaller dorado --input_file  with_cell_type.labeled.bam --f_dmr $BED --f_ref $REF --split_ratio 0.8 --n_cores 23 -o methylbert.test/
MethylBERT v2.0.2
Could not find any statistics to sort DMRs
Number of DMRs to extract sequence reads: 220
Collecting reads from .bam files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.53s/it]Fine-tuning data generated:                                    name flag ref_name   ref_pos map_quality cigar  ...                   CT                                            dna_seq                                         methyl_seq dmr_ctype dmr_label ctype
0  23afdc5b-12f3-4e16-8ba1-bb8c42a21a51    0     chr6  50851183          60  713M  ...  prostate_epithelial  AAA AAC ACG CGT GTT TTT TTC TCA CAA AAG AGG GG...  2202222222222222222222222222222222222220222222...         T       170    NA

[1 rows x 46 columns]
Total sequences per cell type
ctype
NA    1
Name: count, dtype: int64
Traceback (most recent call last):
  File "/home/4470655/.conda/envs/methylbert/bin/methylbert", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/cli.py", line 313, in main
    run_preprocess(args)
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/cli.py", line 238, in run_preprocess
    finetune_data_generate(f_dmr=args.f_dmr,
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/data/finetune_data_generate.py", line 384, in finetune_data_generate
    train_files, test_files = train_test_split(
                              ^^^^^^^^^^^^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2780, in train_test_split
    n_train, n_test = _validate_shuffle_split(
                      ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2410, in _validate_shuffle_split
    raise ValueError(
ValueError: With n_samples=1, test_size=0.19999999999999996 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
Collecting reads from .bam files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.66s/it]

Here is my $BED. It was a DMR based on tumor normal paired comparison. Does ctype means cell type as your example shows?

head $BED
chr     start   end     ctype
chr1    828727  829648  T
chr1    19361106        19361591        T
chr1    37734952        37735434        T
chr1    74543166        74544241        T
chr1    87151545        87152349        T
chr1    91736103        91737407        T
chr1    106081022       106081218       T
chr1    121019929       121021641       T
chr1    156845668       156845999       T

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions