Is there an existing issue for this?
Bug summary
grounding.py ground_term() L.113-121 calculates the semantic similarity of the query to any candidate concept on demand as soon as omop-emb is available.
It would probably be wiser to only do semantic similarity calculation for concepts that come from the EmbeddingResolver instead of all embeddings to not dilute the strong signal from the label-based resolvers.
Specific examples where an embedding dilution happens with the FullTextResolver. It surfaces hundreds of near-duplicate concepts sharing a stem ("Malignant neoplasm of kidney, NOS" / "...except renal pelvis" / etc.) whose embeddings are nearly ndistinguishable from each other relative to the query.
Code for reproduction
Run ground_term with a pipeline composed of various resolvers with active omop-emb.
Error messages
None, just receives potentially wrong scoring.
Is there an existing issue for this?
Bug summary
grounding.pyground_term()L.113-121 calculates the semantic similarity of the query to any candidate concept on demand as soon asomop-embis available.It would probably be wiser to only do semantic similarity calculation for concepts that come from the
EmbeddingResolverinstead of all embeddings to not dilute the strong signal from the label-based resolvers.Specific examples where an embedding dilution happens with the
FullTextResolver. It surfaces hundreds of near-duplicate concepts sharing a stem ("Malignant neoplasm of kidney, NOS" / "...except renal pelvis" / etc.) whose embeddings are nearly ndistinguishable from each other relative to the query.Code for reproduction
Error messages