Problem
comp_seq_cons currently requires a pre-existing MSA file on disk and returns a position-indexed DataFrame independent of AAanalysis' core data structures. Users who already have a df_seq or df_parts cannot compute conservation without manually exporting sequences, generating an MSA, and round-tripping through files.
Goal
Expose sequence conservation as a first-class operation on AAanalysis data: accept df_seq or df_parts as input, compute per-position conservation, and return results in a format compatible with the existing CPP / feature-map pipeline (so conservation can be used as a feature, an annotation, or a filter alongside physicochemical features).
Tasks
- Add
comp_seq_cons(df_seq=...) and comp_seq_cons(df_parts=...) overloads that internally trigger MSA generation per entry (see related issue on Biopython interface)
- Define output schema: per-entry, per-position conservation aligned to the existing position numbering used in CPP (
tmd_start, jmd_n, tmd, jmd_c parts)
- Support multiple conservation metrics (Shannon entropy, Jensen-Shannon divergence, von Neumann entropy, position-specific scoring) via a
metric parameter
- Allow conservation to be merged into a
df_feat-compatible output so it can be plotted with plot_feature_map
- Handle gaps (
_) consistently with the rest of the AAanalysis API (accept_gaps=True)
- Add caching for MSA-derived conservation scores (avoid re-running expensive alignments)
- Add tests with a small synthetic alignment plus one real example (e.g.,
DOM_GSEC benchmark)
How this improves AAanalysis
- Closes the gap between sequence-based and feature-based interpretation: conservation becomes another "scale" in the CPP framework rather than a separate workflow
- Enables direct comparison between physicochemical importance (CPP) and evolutionary importance (conservation) on the same plot
- Removes a manual file-handling step that breaks the otherwise-clean Python API
Acceptance criteria
aa.comp_seq_cons(df_seq=df_seq) returns a DataFrame compatible with plot_feature_map
- At least two conservation metrics are supported with documented trade-offs
- End-to-end tutorial showing conservation overlaid on a CPP feature map
- Existing file-based
comp_seq_cons(msa_file=...) API continues to work (no breaking change)
Problem
comp_seq_conscurrently requires a pre-existing MSA file on disk and returns a position-indexed DataFrame independent of AAanalysis' core data structures. Users who already have adf_seqordf_partscannot compute conservation without manually exporting sequences, generating an MSA, and round-tripping through files.Goal
Expose sequence conservation as a first-class operation on AAanalysis data: accept
df_seqordf_partsas input, compute per-position conservation, and return results in a format compatible with the existing CPP / feature-map pipeline (so conservation can be used as a feature, an annotation, or a filter alongside physicochemical features).Tasks
comp_seq_cons(df_seq=...)andcomp_seq_cons(df_parts=...)overloads that internally trigger MSA generation per entry (see related issue on Biopython interface)tmd_start,jmd_n,tmd,jmd_cparts)metricparameterdf_feat-compatible output so it can be plotted withplot_feature_map_) consistently with the rest of the AAanalysis API (accept_gaps=True)DOM_GSECbenchmark)How this improves AAanalysis
Acceptance criteria
aa.comp_seq_cons(df_seq=df_seq)returns a DataFrame compatible withplot_feature_mapcomp_seq_cons(msa_file=...)API continues to work (no breaking change)