feat: Added IBM Db2 vector store integration#3518
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Hey @bogdankostic I was already reviewing this in #3458 (comment) so I'll take this over |
| - modules: | ||
| - haystack_integrations.document_stores.ibm_db.document_store |
There was a problem hiding this comment.
Missing the embedding retriever
| - modules: | |
| - haystack_integrations.document_stores.ibm_db.document_store | |
| - modules: | |
| - haystack_integrations.components.retrievers.ibm_db.embedding_retriever | |
| - haystack_integrations.document_stores.ibm_db.document_store |
| all = 'pytest {args:tests}' | ||
| unit-cov-retry = 'pytest --cov=haystack_integrations --reruns 3 --reruns-delay 30 -x -m "not integration" {args:tests}' | ||
| integration-cov-append-retry = 'pytest --cov=haystack_integrations --cov-append --reruns 3 --reruns-delay 30 -x -m "integration" {args:tests}' | ||
| types = "mypy -p haystack_integrations.document_stores.ibm_db {args}" |
There was a problem hiding this comment.
missing type checking on the retriever
| types = "mypy -p haystack_integrations.document_stores.ibm_db {args}" | |
| types = "mypy -p haystack_integrations.document_stores.ibm_db -p haystack_integrations.components.retrievers.ibm_db {args}" |
| # If it still fails, raise the error | ||
| raise | ||
|
|
||
| def _validate_embedding(self, embedding: list[float] | None, allow_none: bool = True) -> None: |
There was a problem hiding this comment.
It looks like this method could be made static so lets do that
| def _validate_embedding(self, embedding: list[float] | None, allow_none: bool = True) -> None: | |
| @staticmethod | |
| def _validate_embedding(embedding: list[float] | None, allow_none: bool = True) -> None: |
| msg = "All embedding values must be numeric (int or float)" | ||
| raise TypeError(msg) | ||
|
|
||
| def _to_row(self, doc: Document) -> tuple: |
There was a problem hiding this comment.
Lets make this method static
| def _to_row(self, doc: Document) -> tuple: | |
| @staticmethod | |
| def _to_row(doc: Document) -> tuple: |
There was a problem hiding this comment.
This file is largely redundant and is covered by test_document_store.py I'd cut this down to just testing the util methods: _parse_embedding, _infer_field_type, _validate_embedding and drop the rest.
And to follow our test convention please move the unit tests for these util methods to test_document_store.py in their own test class.
There was a problem hiding this comment.
Please rename this file to test_filters.py to follow our test name convention of one test file per source file that is called the same except with a test_ prefix.
| def test_to_dict(self, document_store): | ||
| """Test serialization to dictionary.""" |
There was a problem hiding this comment.
This uses the document_store fixture which is a live connection to the db. Could we create a mock version instead so this becomes a proper unit test?
| assert d["init_parameters"]["filter_policy"] == "replace" | ||
| assert "document_store" in d["init_parameters"] | ||
|
|
||
| def test_from_dict(self, document_store): |
There was a problem hiding this comment.
| assert result == {"documents": expected} | ||
| mock_store._embedding_retrieval_async.assert_awaited_once() | ||
|
|
||
| def test_from_dict_without_filter_policy(self): |
There was a problem hiding this comment.
Lets drop this test we don't need to "# Simulate an old serialization that lacks the filter_policy field." since this is a new integration.
| @dataclass | ||
| class Db2ConnectionConfig: |
There was a problem hiding this comment.
To be more consistent with our other document store integrations I'd prefer if we could just in-line all of these options in the init method of Db2DocumentStore instead of creating a separate dataclass.
| hostname: str | ||
| port: int = 50000 | ||
| username: str = "" | ||
| password: str = "" |
There was a problem hiding this comment.
For all sensitive information like password we should be using the Secret class from Haystack otherwise we risk exposing this information especially when running serialization. See here for an example
As a heads up this means the to_dict and from_dict may need to be updated to handle the Secret serde. See how its handle in to_dict here and from_dict here
There was a problem hiding this comment.
Please also use Secret on other items you don't think should be exposed in the serialized format.
| """ | ||
| return await asyncio.to_thread(self.count_unique_metadata_by_filter, filters, metadata_fields) | ||
|
|
||
| return await asyncio.to_thread(self.get_metadata_fields_info) |
There was a problem hiding this comment.
This is dead code, lets remove it
| # In this case, we'll return empty results or filter them out | ||
| error_msg = str(e) | ||
| # Check both the error message and the __cause__ attribute | ||
| cause_msg = str(e.__cause__) if hasattr(e, "____cause__") and e.__cause__ else "" |
There was a problem hiding this comment.
Too many underscores
| cause_msg = str(e.__cause__) if hasattr(e, "____cause__") and e.__cause__ else "" | |
| cause_msg = str(e.__cause__) if hasattr(e, "__cause__") and e.__cause__ else "" |
| top_k: Override the constructor top_k for this call. | ||
|
|
||
| Returns: | ||
| ``{"documents": [Document, ...]}`` |
There was a problem hiding this comment.
Please use :param: / :returns: type docstrings like you did in filters.py
| ) -> None: | ||
| if not isinstance(document_store, Db2DocumentStore): |
There was a problem hiding this comment.
Missing docstrings for all init parameters. Please add them.
| Use inside a Haystack pipeline after a text embedder:: | ||
|
|
||
| pipeline.add_component("embedder", SentenceTransformersTextEmbedder()) | ||
| pipeline.add_component("retriever", Db2EmbeddingRetriever( | ||
| document_store=store, top_k=5 | ||
| )) | ||
| pipeline.connect("embedder.embedding", "retriever.query_embedding") |
There was a problem hiding this comment.
Please wrap python code in code blocks.
| filters: dict[str, Any] | None = None, | ||
| top_k: int | None = None, | ||
| ) -> dict[str, list[Document]]: | ||
| """Async variant of :meth:`run`.""" |
There was a problem hiding this comment.
Add docstrings for the variables
There was a problem hiding this comment.
We keep our readmes in integrations very light. See PGVectors as an example https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/pgvector/README.md
Please follow that format. The more in depth code example will be included elsewhere in a separate docs contribution to Haystack core and the integration tile in haystack-integrations
| if not documents: | ||
| return 0 |
There was a problem hiding this comment.
Lets move this after the if not isinstance(documents, list): check so if a user passes in a wrong value like documents="" then it will be caught and raise a ValueError instead of just returning 0
|
Hey @priyanshu-krishnan1 thanks for opening the new PR! I've left an initial set of comments. Also I noticed that the |
| ) | ||
|
|
||
|
|
||
| class Db2DocumentStore: |
There was a problem hiding this comment.
To be consistent with our naming conventions for other doc stores I think it would be great if we could rename this to IBMDb2DocumentStore. WDYT? If so lets also update the embedding retriever to follow the same convention.
|
Hi @sjrl Thanks for providing initial set of comment, currently looking into it. |
…ove obsolete test_from_dict_without_filter_policy test
|
Hey @priyanshu-krishnan1 and @GeetikaChughIBM thanks for the updates! @GeetikaChughIBM would it be possible for you to sign the CLA agreement as well? #3518 (comment) |
Related Issues
Proposed Changes:
Added a new integration for IBM Db2 database with vector search capabilities:
How did you test it?
Notes for the reviewer
Checklist
feat: Add IBM Db2 vector store integration