Skip to content

fix: close prior httpx client before reload to stop FD leak in long fitd runs#1787

Open
mvanhorn wants to merge 1 commit into
NVIDIA:mainfrom
mvanhorn:fix/1466-nim-openai-client-fd-leak
Open

fix: close prior httpx client before reload to stop FD leak in long fitd runs#1787
mvanhorn wants to merge 1 commit into
NVIDIA:mainfrom
mvanhorn:fix/1466-nim-openai-client-fd-leak

Conversation

@mvanhorn
Copy link
Copy Markdown

Summary

In garak/generators/openai.py, modify _call_model's reload branch to call self.client.close() inside a try/except before reassigning, and add a _close_client() helper that null-checks self.client and swallows shutdown errors. In garak/generators/nim.py, wire _load_unsafe to call self._close_client() before instantiating the new openai.OpenAI(...). Add a __del__ (or close()) on OpenAICompatible that also closes the client so garbage-collected generators release their httpx pools. Keep the change strictly scoped to client teardown to avoid altering retry/backoff semantics covered by existing tests.

Why this matters

After roughly 12 hours of an uncapped fitd probe run against NIM endpoints, garak crashes with OSError: [Errno 24] Too many open files. The traceback (added by maintainer jmartin-tech) traces back through garak/generators/openai.py:_call_model -> nim.py:_load_client -> openai.OpenAI(...) -> httpx create_ssl_context, indicating each reload of the OpenAI client opens a new SSL context and httpx pool without closing the prior one. OpenAICompatible._call_model reloads the client whenever self.client is None, and nothing in the reload path closes the previous transport.

Testing

  • Happy path: instantiating an OpenAICompatible generator, setting self.client = None, then calling _call_model does not leak the previous client (assert prior_client.close was called).
  • Edge case: when self.client is already None (first call) the new _close_client() is a no-op and the reload still produces a usable client.
  • Error path: when prior_client.close() raises (e.g. transport already torn down) the reload still succeeds and the failure is logged at debug level rather than propagated.

Fixes #1466

AI was used for assistance.

Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the description poses a plausible explanation of how this might resolve the issue, it misses that _load_unsafe() is only called on instantiation or deserialization when _close_client() would never have an object to operate on as self.client would not be set during normal runtime even when the parallel_attempts option has been enabled.

When _load_unsafe() is called during deserialization of the generator object to another in a new process again self.client is specifically omitted as it is not pickle safe. Meaning the close here would only occur during garbage collection of the spawned processes.

This suggests to me that this PR could be reduced to simply

def __del__(self):
    if self.client:
        self.client.close()

Though this may need some of the exception handling noted in the current _close_client() offering.

There may be some more nuance to this and a proper reproduction test is needed to evaluate the actual required scope of changes.

…itd runs

Fixes NVIDIA#1466

Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
@mvanhorn mvanhorn force-pushed the fix/1466-nim-openai-client-fd-leak branch from f34b8d5 to ed1e98f Compare May 20, 2026 03:39
@mvanhorn
Copy link
Copy Markdown
Author

Thanks for the close read. A couple of points where I think the PR already addresses what you're suggesting, and where the actual leak path sits:

On __del__: The PR already includes exactly the shape you propose, in openai.py:

def __del__(self):
    self._close_client()

and _close_client() already wraps the close in the exception handling that _close_client would need (AttributeError, OSError, RuntimeError, ValueError), with a debug log on failure. So the deserialize-and-discard path you're describing is covered.

On _load_unsafe() adding _close_client(): You're right that during a fresh instantiation or unpickle, self.client isn't set yet, so the _close_client() call is a no-op there. I left those in as defensive belt-and-suspenders, but I'm happy to drop them if you'd prefer the smaller diff -- they don't change behavior in the call paths you cited.

Where the leak actually fires in long fitd runs: the path I was targeting is _call_model in openai.py:

if self.client is None:
    self._close_client()   # <-- this is the new line
    self._load_unsafe()

During fitd we hit this reload path repeatedly, and self.client is None doesn't imply generator._client is None -- after a prior _load_unsafe() we have self.generator populated with an openai.OpenAI whose internal _client (the httpx client) is still open. Without closing it before assigning a new one, each reload leaks the prior httpx client's connection pool, which is what shows up as the FD exhaustion in #1466.

The two tests in test_openai_compatible.py cover both branches: test_openai_reload_closes_prior_client for the path with a prior generator._client, and test_openai_reload_without_prior_client for the cold path.

Happy to:

  1. Drop the defensive _close_client() calls inside _load_unsafe() if you want the diff minimal, OR
  2. Add a more explicit fitd-style reproduction that asserts FD count stays bounded over N reloads -- I have one locally, just wasn't sure if you want it in tests/ proper or as a manual repro in the issue.

Let me know which direction you prefer.

Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location you point to as the root cause seems suspect. self.generator holds a reference to a method on the self.client object and no code ever mutates that reference once set, as far as I can tell. If during _call_model() the self.client object is ever None or unset then self.generator should also be None and there should not be any path to access self.generator._client. The _unsafe_attributes of the class are intentionally not modified during serialization so I am unsure how the described state is being reached. It is possible there is some missed side-effect or something internal to how the OpenAI object holds state in a long running process that may offer that missing piece on context.

Note the stack trace in the issue shows the error having been triggered in a secondary generator used by the probe only on the core thread, which in theory should never pass thru serialization meaning the client object should not be changing or getting cleared. If it is changing then chasing down why may help with understanding the true root cause. If that value is changing then some deeper issue may exist.

For these reasons I am skeptical this solves the issue vs just hiding it better and asked for some clear reproduction test. A unit test if consistent would be great, a quickly reproducible manual test documented as a comment here would also be acceptable.

The additions in _load_unsafe() are not required and the fact they had to be copied into all overrides of _load_unsafe() shows they add a responsibility to the method that is not documented, the goal of the method is to load object that were unsafe to serialize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: FD leak exposed by fitd

2 participants