Skip to content

Semantic similarity is also being used for non-embedding resolvers #23

Description

@nicoloesch

Is there an existing issue for this?

  • I have searched the existing issues

Bug summary

grounding.py ground_term() L.113-121 calculates the semantic similarity of the query to any candidate concept on demand as soon as omop-emb is available.
It would probably be wiser to only do semantic similarity calculation for concepts that come from the EmbeddingResolver instead of all embeddings to not dilute the strong signal from the label-based resolvers.

Specific examples where an embedding dilution happens with the FullTextResolver. It surfaces hundreds of near-duplicate concepts sharing a stem ("Malignant neoplasm of kidney, NOS" / "...except renal pelvis" / etc.) whose embeddings are nearly ndistinguishable from each other relative to the query.

Code for reproduction

Run ground_term with a pipeline composed of various resolvers with active omop-emb.

Error messages

None, just receives potentially wrong scoring.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions