fix: close prior httpx client before reload to stop FD leak in long fitd runs#1787
fix: close prior httpx client before reload to stop FD leak in long fitd runs#1787mvanhorn wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
While the description poses a plausible explanation of how this might resolve the issue, it misses that _load_unsafe() is only called on instantiation or deserialization when _close_client() would never have an object to operate on as self.client would not be set during normal runtime even when the parallel_attempts option has been enabled.
When _load_unsafe() is called during deserialization of the generator object to another in a new process again self.client is specifically omitted as it is not pickle safe. Meaning the close here would only occur during garbage collection of the spawned processes.
This suggests to me that this PR could be reduced to simply
def __del__(self):
if self.client:
self.client.close()Though this may need some of the exception handling noted in the current _close_client() offering.
There may be some more nuance to this and a proper reproduction test is needed to evaluate the actual required scope of changes.
…itd runs Fixes NVIDIA#1466 Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
f34b8d5 to
ed1e98f
Compare
|
Thanks for the close read. A couple of points where I think the PR already addresses what you're suggesting, and where the actual leak path sits: On def __del__(self):
self._close_client()and On Where the leak actually fires in long fitd runs: the path I was targeting is if self.client is None:
self._close_client() # <-- this is the new line
self._load_unsafe()During fitd we hit this reload path repeatedly, and The two tests in Happy to:
Let me know which direction you prefer. |
jmartin-tech
left a comment
There was a problem hiding this comment.
The location you point to as the root cause seems suspect. self.generator holds a reference to a method on the self.client object and no code ever mutates that reference once set, as far as I can tell. If during _call_model() the self.client object is ever None or unset then self.generator should also be None and there should not be any path to access self.generator._client. The _unsafe_attributes of the class are intentionally not modified during serialization so I am unsure how the described state is being reached. It is possible there is some missed side-effect or something internal to how the OpenAI object holds state in a long running process that may offer that missing piece on context.
Note the stack trace in the issue shows the error having been triggered in a secondary generator used by the probe only on the core thread, which in theory should never pass thru serialization meaning the client object should not be changing or getting cleared. If it is changing then chasing down why may help with understanding the true root cause. If that value is changing then some deeper issue may exist.
For these reasons I am skeptical this solves the issue vs just hiding it better and asked for some clear reproduction test. A unit test if consistent would be great, a quickly reproducible manual test documented as a comment here would also be acceptable.
The additions in _load_unsafe() are not required and the fact they had to be copied into all overrides of _load_unsafe() shows they add a responsibility to the method that is not documented, the goal of the method is to load object that were unsafe to serialize.
Summary
In
garak/generators/openai.py, modify_call_model's reload branch to callself.client.close()inside atry/exceptbefore reassigning, and add a_close_client()helper that null-checksself.clientand swallows shutdown errors. Ingarak/generators/nim.py, wire_load_unsafeto callself._close_client()before instantiating the newopenai.OpenAI(...). Add a__del__(orclose()) onOpenAICompatiblethat also closes the client so garbage-collected generators release their httpx pools. Keep the change strictly scoped to client teardown to avoid altering retry/backoff semantics covered by existing tests.Why this matters
After roughly 12 hours of an uncapped
fitdprobe run against NIM endpoints, garak crashes withOSError: [Errno 24] Too many open files. The traceback (added by maintainer jmartin-tech) traces back throughgarak/generators/openai.py:_call_model->nim.py:_load_client->openai.OpenAI(...)-> httpxcreate_ssl_context, indicating each reload of the OpenAI client opens a new SSL context and httpx pool without closing the prior one.OpenAICompatible._call_modelreloads the client wheneverself.client is None, and nothing in the reload path closes the previous transport.Testing
OpenAICompatiblegenerator, settingself.client = None, then calling_call_modeldoes not leak the previous client (assertprior_client.closewas called).self.clientis alreadyNone(first call) the new_close_client()is a no-op and the reload still produces a usable client.prior_client.close()raises (e.g. transport already torn down) the reload still succeeds and the failure is logged at debug level rather than propagated.Fixes #1466
AI was used for assistance.