Skip to content

feat: Added IBM Db2 vector store integration#3518

Open
priyanshu-krishnan1 wants to merge 7 commits into
deepset-ai:mainfrom
GeetikaChughIBM:ibm-db2-vectorstore
Open

feat: Added IBM Db2 vector store integration#3518
priyanshu-krishnan1 wants to merge 7 commits into
deepset-ai:mainfrom
GeetikaChughIBM:ibm-db2-vectorstore

Conversation

@priyanshu-krishnan1

Copy link
Copy Markdown

Related Issues

  • Adds IBM Db2 vector store integration for Haystack

Proposed Changes:

Added a new integration for IBM Db2 database with vector search capabilities:

  • Db2DocumentStore: Document store with vector similarity search using DB2's native VECTOR type
  • Db2EmbeddingRetriever: Retriever component for semantic search with metadata filtering
  • FilterTranslator: Converts Haystack filters to DB2 SQL with support for complex logical operators

How did you test it?

  • Unit tests for document store operations, filter translation, and connection handling
  • Integration tests using Docker Compose with IBM Db2 Community Edition
  • Haystack document store mixin tests
  • Manual verification with local Db2 instance

Notes for the reviewer

  • Follows standard Haystack document store patterns (similar to pgvector, oracle)

Checklist

@priyanshu-krishnan1 priyanshu-krishnan1 requested a review from a team as a code owner July 1, 2026 14:31
@priyanshu-krishnan1 priyanshu-krishnan1 requested review from bogdankostic and removed request for a team July 1, 2026 14:31
@socket-security

socket-security Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpypi/​ibm-db@​3.2.984100100100100

View full report

@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Jul 1, 2026
@sjrl sjrl requested review from sjrl and removed request for bogdankostic July 1, 2026 15:37
@sjrl

sjrl commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Hey @bogdankostic I was already reviewing this in #3458 (comment) so I'll take this over

@sjrl sjrl self-assigned this Jul 1, 2026
Comment on lines +2 to +3
- modules:
- haystack_integrations.document_stores.ibm_db.document_store

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the embedding retriever

Suggested change
- modules:
- haystack_integrations.document_stores.ibm_db.document_store
- modules:
- haystack_integrations.components.retrievers.ibm_db.embedding_retriever
- haystack_integrations.document_stores.ibm_db.document_store

Comment thread integrations/ibm_db/pyproject.toml Outdated
all = 'pytest {args:tests}'
unit-cov-retry = 'pytest --cov=haystack_integrations --reruns 3 --reruns-delay 30 -x -m "not integration" {args:tests}'
integration-cov-append-retry = 'pytest --cov=haystack_integrations --cov-append --reruns 3 --reruns-delay 30 -x -m "integration" {args:tests}'
types = "mypy -p haystack_integrations.document_stores.ibm_db {args}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing type checking on the retriever

Suggested change
types = "mypy -p haystack_integrations.document_stores.ibm_db {args}"
types = "mypy -p haystack_integrations.document_stores.ibm_db -p haystack_integrations.components.retrievers.ibm_db {args}"

# If it still fails, raise the error
raise

def _validate_embedding(self, embedding: list[float] | None, allow_none: bool = True) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this method could be made static so lets do that

Suggested change
def _validate_embedding(self, embedding: list[float] | None, allow_none: bool = True) -> None:
@staticmethod
def _validate_embedding(embedding: list[float] | None, allow_none: bool = True) -> None:

msg = "All embedding values must be numeric (int or float)"
raise TypeError(msg)

def _to_row(self, doc: Document) -> tuple:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make this method static

Suggested change
def _to_row(self, doc: Document) -> tuple:
@staticmethod
def _to_row(doc: Document) -> tuple:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is largely redundant and is covered by test_document_store.py I'd cut this down to just testing the util methods: _parse_embedding, _infer_field_type, _validate_embedding and drop the rest.

And to follow our test convention please move the unit tests for these util methods to test_document_store.py in their own test class.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this file to test_filters.py to follow our test name convention of one test file per source file that is called the same except with a test_ prefix.

Comment on lines +96 to +97
def test_to_dict(self, document_store):
"""Test serialization to dictionary."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the document_store fixture which is a live connection to the db. Could we create a mock version instead so this becomes a proper unit test?

assert d["init_parameters"]["filter_policy"] == "replace"
assert "document_store" in d["init_parameters"]

def test_from_dict(self, document_store):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert result == {"documents": expected}
mock_store._embedding_retrieval_async.assert_awaited_once()

def test_from_dict_without_filter_policy(self):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets drop this test we don't need to "# Simulate an old serialization that lacks the filter_policy field." since this is a new integration.

Comment on lines +25 to +26
@dataclass
class Db2ConnectionConfig:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more consistent with our other document store integrations I'd prefer if we could just in-line all of these options in the init method of Db2DocumentStore instead of creating a separate dataclass.

hostname: str
port: int = 50000
username: str = ""
password: str = ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all sensitive information like password we should be using the Secret class from Haystack otherwise we risk exposing this information especially when running serialization. See here for an example

As a heads up this means the to_dict and from_dict may need to be updated to handle the Secret serde. See how its handle in to_dict here and from_dict here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also use Secret on other items you don't think should be exposed in the serialized format.

"""
return await asyncio.to_thread(self.count_unique_metadata_by_filter, filters, metadata_fields)

return await asyncio.to_thread(self.get_metadata_fields_info)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead code, lets remove it

# In this case, we'll return empty results or filter them out
error_msg = str(e)
# Check both the error message and the __cause__ attribute
cause_msg = str(e.__cause__) if hasattr(e, "____cause__") and e.__cause__ else ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many underscores

Suggested change
cause_msg = str(e.__cause__) if hasattr(e, "____cause__") and e.__cause__ else ""
cause_msg = str(e.__cause__) if hasattr(e, "__cause__") and e.__cause__ else ""

top_k: Override the constructor top_k for this call.

Returns:
``{"documents": [Document, ...]}``

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use :param: / :returns: type docstrings like you did in filters.py

Comment on lines +36 to +37
) -> None:
if not isinstance(document_store, Db2DocumentStore):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstrings for all init parameters. Please add them.

Comment on lines +20 to +26
Use inside a Haystack pipeline after a text embedder::

pipeline.add_component("embedder", SentenceTransformersTextEmbedder())
pipeline.add_component("retriever", Db2EmbeddingRetriever(
document_store=store, top_k=5
))
pipeline.connect("embedder.embedding", "retriever.query_embedding")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap python code in code blocks.

filters: dict[str, Any] | None = None,
top_k: int | None = None,
) -> dict[str, list[Document]]:
"""Async variant of :meth:`run`."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstrings for the variables

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep our readmes in integrations very light. See PGVectors as an example https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/pgvector/README.md

Please follow that format. The more in depth code example will be included elsewhere in a separate docs contribution to Haystack core and the integration tile in haystack-integrations

Comment on lines +354 to +355
if not documents:
return 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move this after the if not isinstance(documents, list): check so if a user passes in a wrong value like documents="" then it will be caught and raise a ValueError instead of just returning 0

@sjrl

sjrl commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Hey @priyanshu-krishnan1 thanks for opening the new PR! I've left an initial set of comments.

Also I noticed that the LICENSE.txt file is missing. Please add one. You can copy the one at the top-level of the repo which is here

)


class Db2DocumentStore:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with our naming conventions for other doc stores I think it would be great if we could rename this to IBMDb2DocumentStore. WDYT? If so lets also update the embedding retriever to follow the same convention.

@priyanshu-krishnan1

Copy link
Copy Markdown
Author

Hi @sjrl Thanks for providing initial set of comment, currently looking into it.
we will update the PR with resolution for them.

@CLAassistant

CLAassistant commented Jul 2, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@sjrl

sjrl commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Hey @priyanshu-krishnan1 and @GeetikaChughIBM thanks for the updates! @GeetikaChughIBM would it be possible for you to sign the CLA agreement as well? #3518 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants