fix: Upgrade azure-ai-projects and agent-framework#282
Draft
Prachig-Microsoft wants to merge 28 commits into
Draft
fix: Upgrade azure-ai-projects and agent-framework#282Prachig-Microsoft wants to merge 28 commits into
Prachig-Microsoft wants to merge 28 commits into
Conversation
chore: Dev merge to Main
chore: Dev merge to Main
- Update agent-framework from 1.0.0b260107 to 1.3.0 in pyproject.toml - Update azure-ai-projects from 1.0.0b12 to 2.1.0 in requirements.txt - Migrate ChatAgent to Agent (client=, default_options=ChatOptions) - Migrate agent_framework.azure to agent_framework.openai module paths - Migrate ChatMessage to Message with Content.from_text() - Migrate Role enum to string literals - Migrate AgentRunContext to AgentContext - Migrate WorkflowBuilder to new API (start_executor=, add_chain) - Migrate event handling from isinstance checks to WorkflowEvent.type - Migrate GroupChatBuilder to agent_framework.orchestrations module - Migrate ContextProvider to before_run/after_run interface - Remove ToolProtocol (use Any), AgentProtocol (use SupportsAgentRun) - Define ManagerSelectionResponse locally (removed from framework) - Update MCP tool files for Agent import - Update all unit tests for new APIs (812 tests passing) - Update docs/ProcessFrameworkGuide.md with new WorkflowBuilder example - Update docs/LocalDevelopmentSetup.md prerelease note - Regenerate uv.lock Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AgentBuilder.with_context_providers() and with_middleware() accepted single objects but passed them directly to Agent(), which expects Sequence types. Now both methods auto-wrap single items into a list. Also wrapped the call site in orchestrator_base.py for clarity. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The parent OpenAIChatClient._inner_get_response is a regular def that returns ResponseStream (async iterable) when stream=True, or Awaitable when stream=False. The override was async def, which always returned a coroutine, breaking 'async for event in workflow.run(stream=True)'. Refactored to: - Regular def _inner_get_response dispatching stream vs non-stream - _non_streaming_with_retry: async coroutine with retry + context-trim - _streaming_with_retry: async generator with pre-first-chunk retry - _maybe_trim_messages: shared context-trim helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…erator The framework's BaseChatClient.get_response checks isinstance(result, ResponseStream) for streaming responses. Our async generator from _streaming_with_retry failed that check, causing the framework to 'await' it — which fails with 'object async_generator can't be used in await expression'. Fix: for streaming, pass through to the parent's _inner_get_response which returns a proper ResponseStream. Retry is preserved for non-streaming calls. Removed unused _streaming_with_retry method. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- _trim_messages: keep at least 1 message (never pop to empty) - _maybe_trim_messages: fall back to originals if trim produces empty - _non_streaming_with_retry: re-raise if aggressive trim empties list - _inner_get_response: log warning and use originals if messages empty Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Responses API requires the new v1 API endpoint. The old preview version (2025-03-01-preview) does not support the /responses endpoint, causing BadRequest 'API version not supported' errors at runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit 1d86176.
…pletions API) Adds a new AzureOpenAIChatClientWithRetry that wraps OpenAIChatCompletionClient (the /chat/completions endpoint) with the same 429-retry and context-trimming logic as the existing AzureOpenAIResponseClientWithRetry, then switches the default client registered in AgentFrameworkHelper and the per-thread client in OrchestratorBase to use it. The /chat/completions endpoint works with the existing 2025-03-01-preview Azure OpenAI API version, so the v1 API-version bump (commit 1d86176) is no longer required and is reverted in the prior commit. Mirrors the approach used in microsoft/content-processing-solution-accelerator#599. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eter
OpenAIChatCompletionClient._inner_get_response is a SYNC method that returns
either Awaitable[ChatResponse] (stream=False) or ResponseStream (stream=True),
matching the OpenAIChatClient (Responses API) shape.
The previous implementation used async def without a stream parameter, which
caused the framework's streaming path to receive a coroutine instead of an
AsyncIterable, raising:
'async for' requires an object with __aiter__ method, got coroutine
Mirror the existing AzureOpenAIResponseClientWithRetry pattern: sync _inner_get_response
that branches on stream and delegates non-streaming calls to _non_streaming_with_retry.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OpenAI's Chat Completions endpoint validates the message `name` field against the pattern `^[^\s<|\\/>]+$`. Our agents have display names with whitespace (e.g. `Chief Architect`, `AKS Expert`), which caused a 400 BadRequest after switching the default client to `AzureOpenAIChatClientWithRetry`. Add `_sanitize_author_name` / `_sanitize_author_names` helpers that replace runs of disallowed characters (whitespace, `<`, `|`, `\`, `/`, `>`) with a single underscore and strip leading/trailing underscores. Names that sanitize down to an empty string are dropped entirely so the field can be omitted from the request. The sanitizer is applied inside `AzureOpenAIChatClientWithRetry._inner_get_response` after context trimming (and again after the trim-fallback retry inside `_non_streaming_with_retry`) so the wire format passes validation while in-memory `Message` objects keep their original display names for orchestration logic. Originals are never mutated — modified messages are shallow-copied before the name is rewritten. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The [AOAI_RETRY] empty messages list received warning fired on every turn in group-chat orchestration when the same speaker was selected twice in a row, flooding logs and giving the false impression of an error. This pattern is by design in agent-framework's GroupChatOrchestrator: _broadcast_messages_to_participants excludes the source executor, so when the orchestrator routes back to the same agent, its message cache is empty. The framework already emits its own "AgentExecutor ... Running agent with empty message cache" warning for this case. The actual API call is not empty -- the parent OpenAIChatCompletionClient._prepare_options prepends the agent's system instructions from options["instructions"] before sending. So demoting our duplicate warning to DEBUG removes the noise without hiding any real failure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The agent_framework_orchestrations.GroupChatBuilder forces the Coordinator's response_format to AgentOrchestrationOutput (strict schema with fields next_speaker/reason/terminate). Our prompt asks for selected_participant/instruction/finish, but strict structured output overrides the prompt's field names. Without aliases, ManagerSelectionResponse.model_validate() silently succeeded with all fields = None (extra=allow), which disabled: - The 3-strike loop-detection streak counter (line 1019-1054) - Coordinator-driven termination on finish=true (line 1065) - _agent_invoked_at[selected] elapsed-time tracking (line 1098) Use Pydantic AliasChoices so the model accepts BOTH naming conventions, restoring anti-loop and termination logic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Production still hits 400 BadRequest on messages[N].name even though _inner_get_response runs _sanitize_author_names on incoming Messages. The framework's _prepare_options/_prepare_messages_for_openai layer or agent-internal compaction can materialize messages with author_name set AFTER our early sanitization, leaving the dict 'name' field unsanitized on the wire. Override _prepare_messages_for_openai (the parent method that builds the final OpenAI dict payload) to sanitize each dict's 'name' field as a last-mile pass. This is the single chokepoint guaranteed to be on every Chat Completions request, regardless of upstream message-construction path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When Coordinator keeps picking the same agent A and A keeps running, A's own completions were bumping _progress_counter. Loop detection compares the counter snapshot taken at the previous identical Coordinator pick against the current value; if it changed, the streak was reset to 1. So the 3-strike threshold was never reached and the Coordinator->A->A pattern ran until max_rounds. Now we only treat a non-Coordinator completion as 'progress' when the completing agent is different from the agent the Coordinator is currently latching onto (_last_coordinator_selection[0]). A different agent stepping in still resets the streak; A repeating itself does not. Adds two regression tests covering both cases. Also updates an existing termination test whose name described 'other agent makes progress' but actually used the same agent, hard-coding the buggy semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ute by capability The Coordinator's valid_participants block was a bullet list of names only, so the LLM had no per-agent capability signal. Combined with a Coordinator prompt that names 'Chief Architect' frequently across phases 0/1/4/5/6, the model latched onto Chief Architect repeatedly and the conversation looped on the same agent. This change populates agent_description on every Analysis participant (Chief Architect, AKS Expert, and the platform experts in platform_registry.json) and renders each description into the Coordinator's valid_participants list. The descriptions are also passed through AgentBuilder.create_agent_by_agentinfo's existing description= argument, so the framework's Agent.description field is no longer always None. Scope: Analysis step only. design/yaml/documentation orchestrators are left for a follow-up after this change is validated in production. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Production run after the previous progress-counter fix (0da531f) STILL showed Chief Architect picked 6+ consecutive times. Root cause: the loop detection key was (agent, instruction_text). The LLM-driven Coordinator varies its instruction on every pick ('list source blobs', 'read xyz.yaml', 'save analysis_result.md') while latching onto the same agent — so every selection_key was unique, the streak reset to 1 on every pick, and the 3-strike threshold was never reached. Change: track only the agent name (lower-cased). The progress counter (now correct after 0da531f) already encodes 'no DIFFERENT agent ran in between', so 3 consecutive picks of the same agent with no other-agent progress is a strong, low-false-positive loop signal. Adds a regression test that replays the production sequence (same agent, three different instruction strings) and verifies forced termination fires. The earlier tests for exact-match repeats and for B-resets-the- streak continue to pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Production deployment of the agent-framework 1.3.0 upgrade surfaced a crash chain: Analysis "succeeded" with a self-contradictory result (result=True, is_hard_terminated=False, output=None), Design then crashed at `task_param.output.process_id`. The root cause is the ResultGenerator returning an empty shell when participants never produced useful content. Fixes: * groupchat_orchestrator.run_stream now validates ResultGenerator output before constructing OrchestrationResult. If the result is not hard terminated but carries no `output` / `termination_output` payload, the orchestrator now reports success=False with a descriptive error. This is generic across all four step models (Analysis uses `output`; Design/Convert/Documentation use `termination_output`). * All four step executors gained a defense-in-depth guard that raises a clear `<Step>Executor failed: produced no <X>Output. Reason: ...` exception when the same incoherent shape is observed. This stops the broken value at the boundary instead of propagating it downstream. * groupchat_orchestrator silent `except Exception: pass` around Coordinator JSON parsing replaced with `logger.debug(... exc_info=...)` so loop-detection failures become visible during debugging instead of being swallowed. Tests: * Updated each executor's existing soft-completion test to provide a valid output (previous setup encoded the broken shape we now reject). * Added a new guard test per executor asserting the new exception fires for the incoherent (success=True + output=None + not hard-terminated) shape. * Full unit suite: 829 passed (was 825; +4 new guard tests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause of the "Runner did not converge after 100 iterations"
production failure (and the Chief-Architect-only loop that preceded it):
agent-framework 1.3.0 changed how AgentResponseUpdate is constructed.
`map_chat_to_agent_update` (_types.py:2825-2837) now only sets
`author_name` and leaves `agent_id` as None.
Our orchestrator was reading `event.agent_id` exclusively, so every
streaming update resolved to `agent_name=""`. That silently broke:
* Loop detection (line 1080 `if agent_name == self.coordinator_name`
never matched, so the streak counter never advanced and the 3x
same-agent guard never fired). Production looped 100x on Chief
Architect with zero detection.
* Coordinator termination signal extraction (`finish=true`,
`instruction=complete`, blocking instructions) - same gated block.
* Manager-instruction parsing for the next participant.
The [MEMORY] logs continued to show real agent names ("Chief Architect")
because `SharedMemoryContextProvider` reads the name from the agent's
own context, not from the workflow event - which is why the regression
was invisible from logs alone.
Fix: in `_handle_agent_update`, prefer `event.author_name` (which IS
populated by 1.3.0's `map_chat_to_agent_update`) and fall back to
`agent_id` only when author_name is missing, for backwards compat with
older event shapes. Use `getattr` defensively so existing tests that
construct SimpleNamespace events without author_name still work.
Tests:
* test_handle_agent_update_resolves_coordinator_via_author_name_when_agent_id_is_none
- asserts the identity resolution itself
* test_loop_detection_fires_on_3_consecutive_coordinator_selections_via_handle_agent_update
- end-to-end through the production code path: 3 identical Coordinator
selections via _handle_agent_update must trip _forced_termination
* Both tests verified to FAIL without the fix (intentionally reverted to
confirm) and PASS with the fix
* Full suite: 831 passed (was 829, +2 regression tests)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rounds for af 1.3.0
In agent-framework 1.3.0, `workflow.run(stream=True)` only yields
`WorkflowEvent` instances. `AgentResponseUpdate` is wrapped inside
`event.data` for `type=="output"` events. The two types are unrelated
(verified by MRO), so the previous `isinstance(event, AgentResponseUpdate)`
gate from the b260107 era was permanently dead in 1.3.0. As a result every
orchestrator-side safety guard inside that branch silently no-opped:
* per-agent loop detection
* Coordinator finish=true detection
* max_rounds enforcement
* streaming callback dispatch
* manager-instruction extraction
That is why production runs hit the framework's own 100-iteration runner
cap as `RuntimeError("Runner did not converge after 100 iterations")`
even after the recent identity-resolution patch (which only touched code
that never executed).
Three coordinated fixes:
1. Replace the dead `isinstance(event, AgentResponseUpdate)` gate with
`isinstance(event, WorkflowEvent) and event.type == "output"` and
inspect `event.data` / `event.executor_id` to distinguish per-
participant streaming chunks (executor_id matches one of self.agents
and data is AgentResponseUpdate) from the framework orchestrator's
final output (list[Message] or custom result object).
2. Add `executor_id` parameter to `_handle_agent_update` so identity
resolves from the WorkflowEvent wrapper's executor_id (always populated
from `AgentExecutor.id` = the agent's name) first, then falls back
to `event.author_name`, then legacy `event.agent_id`. Matches the
approach already used by Content Processing Solution.
3. Pass `max_rounds=self.max_rounds` and `intermediate_outputs=True`
to `GroupChatBuilder`:
- `max_rounds` gives the framework itself a clean termination
ceiling so even if our orchestrator-side guards miss, the workflow
halts cleanly instead of crashing at the runner's 100-iteration cap.
- `intermediate_outputs=True` is required for each participant's
`yield_output(AgentResponseUpdate)` call to surface as a workflow
`output` event. Without this, only the orchestrator's final yield
reaches our streaming loop and the per-agent guards above never run.
Tests:
* Existing termination/loop-detection tests still pass (handler now has
3-tier identity resolution with backward-compat for `author_name`).
* Added `test_handle_agent_update_prefers_executor_id_over_author_name`
to lock in the new precedence.
* Added `test_handle_agent_update_strips_executor_id_prefix` to cover
the `groupchat_agent:Coordinator` framework prefix.
* Full suite: 833 passed (was 831; +2 new tests).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In agent-framework 1.3.0 the GroupChat orchestrator agent (Coordinator) is invoked directly inside the framework's internal _invoke_agent_helper (agent_framework_orchestrations/_group_chat.py:484) rather than through an AgentExecutor. The Coordinator therefore never surfaces as a workflow event, which makes our existing Coordinator-JSON-based loop detector in _complete_agent_response permanently dead in 1.3.0. Symptom in production: workflow loops with the Coordinator latched onto the same participant (e.g., Chief Architect repeatedly asked to produce an Evidence Pack that never satisfies the next reviewer). The loop runs until the framework's max_rounds ceiling fires (~17 min at default 100) instead of being caught early. Fix: * Track participant turn completions from WorkflowEvent.executor_completed, the one observable signal that does NOT depend on Coordinator visibility (participants ARE wrapped in AgentExecutor and so do emit these events). * Force-terminate (hard_loop) after 3 consecutive completions of the same participant. * Force-terminate (hard_timeout) when total participant completions reach max_rounds; independent of len(agent_responses) which only grows on agent switch and so can never reach max_rounds during a same-participant loop. * Flush per-participant streaming buffer on each executor_completed so back-to-back same-agent turns produce one AgentResponse per turn instead of accumulating across turns. * Move forced-termination break check to top of the streaming loop so any branch (timeout, participant loop, Coordinator finish=true) takes effect on the very next event rather than waiting for the next output event. Adds 3 regression tests covering the streak trigger, the alternation reset, and the round-budget enforcement. 836 tests pass (833 -> 836). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ed events" This reverts commit 7a0212f.
…p dead code * Add a narrow logging.Filter on agent_framework._workflows._agent_executor that drops only the 'Running agent with empty message cache' message. This warning fires by design in GroupChat orchestration when the orchestrator routes back to the same speaker (broadcast cache is empty because _broadcast_messages_to_participants excludes the source executor). The framework's parent client prepends system instructions before the LLM call, so the API request still has content. Other warnings/errors from the same logger remain visible. * Remove three lines of commented-out duplicate callback invocation in groupchat_orchestrator._complete_agent_response. The live callback handler is in the block directly above; the commented block was refactor debris. No behavioural change. All 833 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request makes significant updates to the agent framework integration and developer experience, focusing on modernizing the agent builder interface, updating dependencies, and improving documentation for clarity and correctness. The most important changes are grouped below:
Agent Framework Refactor and Modernization:
agent_builder.pyto use the newAgentclass instead ofChatAgent, updated type hints to support new interfaces (e.g.,SupportsChatGetResponse,HistoryProvider), and reworked option handling to use aChatOptionsobject. Added logic to strip unsupported parameters for certain reasoning models and improved middleware/context provider handling for greater flexibility. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]Dependency Updates:
agent-frameworkto version1.3.0insrc/processor/pyproject.tomlandazure-ai-projectsto2.1.0ininfra/vscode_web/requirements.txtto ensure compatibility with the latest features and bug fixes. [1] [2]SDK Usage Modernization:
infra/vscode_web/codeSample.pyto use the latestazure-ai-projectsAPI patterns, including the new import forListSortOrder, revised thread/message/run creation methods, and improved error handling and message printing logic. [1] [2]Documentation Improvements:
--prerelease=allowflag in dependency installation instructions and provided more context about transitive pre-release dependencies indocs/LocalDevelopmentSetup.md.docs/ProcessFrameworkGuide.mdto reflect the newWorkflowBuilderAPI, and simplified the unit test command to remove unnecessary flags. [1] [2]These changes collectively modernize the codebase, improve developer onboarding, and ensure compatibility with the latest SDK and agent framework features.## Purpose
Does this introduce a breaking change?
Golden Path Validation
Deployment Validation
What to Check
Verify that the following are valid
Other Information