Skip to content

Sequence conservation as a first-class feature (df_seq / df_parts input) #64

@breimanntools

Description

@breimanntools

Problem

comp_seq_cons currently requires a pre-existing MSA file on disk and returns a position-indexed DataFrame independent of AAanalysis' core data structures. Users who already have a df_seq or df_parts cannot compute conservation without manually exporting sequences, generating an MSA, and round-tripping through files.

Goal

Expose sequence conservation as a first-class operation on AAanalysis data: accept df_seq or df_parts as input, compute per-position conservation, and return results in a format compatible with the existing CPP / feature-map pipeline (so conservation can be used as a feature, an annotation, or a filter alongside physicochemical features).

Tasks

  • Add comp_seq_cons(df_seq=...) and comp_seq_cons(df_parts=...) overloads that internally trigger MSA generation per entry (see related issue on Biopython interface)
  • Define output schema: per-entry, per-position conservation aligned to the existing position numbering used in CPP (tmd_start, jmd_n, tmd, jmd_c parts)
  • Support multiple conservation metrics (Shannon entropy, Jensen-Shannon divergence, von Neumann entropy, position-specific scoring) via a metric parameter
  • Allow conservation to be merged into a df_feat-compatible output so it can be plotted with plot_feature_map
  • Handle gaps (_) consistently with the rest of the AAanalysis API (accept_gaps=True)
  • Add caching for MSA-derived conservation scores (avoid re-running expensive alignments)
  • Add tests with a small synthetic alignment plus one real example (e.g., DOM_GSEC benchmark)

How this improves AAanalysis

  • Closes the gap between sequence-based and feature-based interpretation: conservation becomes another "scale" in the CPP framework rather than a separate workflow
  • Enables direct comparison between physicochemical importance (CPP) and evolutionary importance (conservation) on the same plot
  • Removes a manual file-handling step that breaks the otherwise-clean Python API

Acceptance criteria

  • aa.comp_seq_cons(df_seq=df_seq) returns a DataFrame compatible with plot_feature_map
  • At least two conservation metrics are supported with documented trade-offs
  • End-to-end tutorial showing conservation overlaid on a CPP feature map
  • Existing file-based comp_seq_cons(msa_file=...) API continues to work (no breaking change)

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions