fix: Upgrade azure-ai-projects and agent-framework by Prachig-Microsoft · Pull Request #282 · microsoft/Container-Migration-Solution-Accelerator

Prachig-Microsoft · 2026-06-13T11:55:04Z

This pull request makes significant updates to the agent framework integration and developer experience, focusing on modernizing the agent builder interface, updating dependencies, and improving documentation for clarity and correctness. The most important changes are grouped below:

Agent Framework Refactor and Modernization:

Refactored agent_builder.py to use the new Agent class instead of ChatAgent, updated type hints to support new interfaces (e.g., SupportsChatGetResponse, HistoryProvider), and reworked option handling to use a ChatOptions object. Added logic to strip unsupported parameters for certain reasoning models and improved middleware/context provider handling for greater flexibility. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

Dependency Updates:

Upgraded agent-framework to version 1.3.0 in src/processor/pyproject.toml and azure-ai-projects to 2.1.0 in infra/vscode_web/requirements.txt to ensure compatibility with the latest features and bug fixes. [1] [2]

SDK Usage Modernization:

Updated infra/vscode_web/codeSample.py to use the latest azure-ai-projects API patterns, including the new import for ListSortOrder, revised thread/message/run creation methods, and improved error handling and message printing logic. [1] [2]

Documentation Improvements:

Clarified the need for the --prerelease=allow flag in dependency installation instructions and provided more context about transitive pre-release dependencies in docs/LocalDevelopmentSetup.md.
Updated workflow construction instructions in docs/ProcessFrameworkGuide.md to reflect the new WorkflowBuilder API, and simplified the unit test command to remove unnecessary flags. [1] [2]

These changes collectively modernize the codebase, improve developer onboarding, and ensure compatibility with the latest SDK and agent framework features.## Purpose

...

Does this introduce a breaking change?

Yes
No

Golden Path Validation

I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.

Deployment Validation

I have validated the deployment process successfully and all services are running as expected with this change.

What to Check

Verify that the following are valid

...

Other Information

chore: Dev merge to Main

- Update agent-framework from 1.0.0b260107 to 1.3.0 in pyproject.toml - Update azure-ai-projects from 1.0.0b12 to 2.1.0 in requirements.txt - Migrate ChatAgent to Agent (client=, default_options=ChatOptions) - Migrate agent_framework.azure to agent_framework.openai module paths - Migrate ChatMessage to Message with Content.from_text() - Migrate Role enum to string literals - Migrate AgentRunContext to AgentContext - Migrate WorkflowBuilder to new API (start_executor=, add_chain) - Migrate event handling from isinstance checks to WorkflowEvent.type - Migrate GroupChatBuilder to agent_framework.orchestrations module - Migrate ContextProvider to before_run/after_run interface - Remove ToolProtocol (use Any), AgentProtocol (use SupportsAgentRun) - Define ManagerSelectionResponse locally (removed from framework) - Update MCP tool files for Agent import - Update all unit tests for new APIs (812 tests passing) - Update docs/ProcessFrameworkGuide.md with new WorkflowBuilder example - Update docs/LocalDevelopmentSetup.md prerelease note - Regenerate uv.lock Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

AgentBuilder.with_context_providers() and with_middleware() accepted single objects but passed them directly to Agent(), which expects Sequence types. Now both methods auto-wrap single items into a list. Also wrapped the call site in orchestrator_base.py for clarity. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The parent OpenAIChatClient._inner_get_response is a regular def that returns ResponseStream (async iterable) when stream=True, or Awaitable when stream=False. The override was async def, which always returned a coroutine, breaking 'async for event in workflow.run(stream=True)'. Refactored to: - Regular def _inner_get_response dispatching stream vs non-stream - _non_streaming_with_retry: async coroutine with retry + context-trim - _streaming_with_retry: async generator with pre-first-chunk retry - _maybe_trim_messages: shared context-trim helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…erator The framework's BaseChatClient.get_response checks isinstance(result, ResponseStream) for streaming responses. Our async generator from _streaming_with_retry failed that check, causing the framework to 'await' it — which fails with 'object async_generator can't be used in await expression'. Fix: for streaming, pass through to the parent's _inner_get_response which returns a proper ResponseStream. Retry is preserved for non-streaming calls. Removed unused _streaming_with_retry method. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- _trim_messages: keep at least 1 message (never pop to empty) - _maybe_trim_messages: fall back to originals if trim produces empty - _non_streaming_with_retry: re-raise if aggressive trim empties list - _inner_get_response: log warning and use originals if messages empty Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Responses API requires the new v1 API endpoint. The old preview version (2025-03-01-preview) does not support the /responses endpoint, causing BadRequest 'API version not supported' errors at runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This reverts commit 1d86176.

…pletions API) Adds a new AzureOpenAIChatClientWithRetry that wraps OpenAIChatCompletionClient (the /chat/completions endpoint) with the same 429-retry and context-trimming logic as the existing AzureOpenAIResponseClientWithRetry, then switches the default client registered in AgentFrameworkHelper and the per-thread client in OrchestratorBase to use it. The /chat/completions endpoint works with the existing 2025-03-01-preview Azure OpenAI API version, so the v1 API-version bump (commit 1d86176) is no longer required and is reverted in the prior commit. Mirrors the approach used in microsoft/content-processing-solution-accelerator#599. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…eter OpenAIChatCompletionClient._inner_get_response is a SYNC method that returns either Awaitable[ChatResponse] (stream=False) or ResponseStream (stream=True), matching the OpenAIChatClient (Responses API) shape. The previous implementation used async def without a stream parameter, which caused the framework's streaming path to receive a coroutine instead of an AsyncIterable, raising: 'async for' requires an object with __aiter__ method, got coroutine Mirror the existing AzureOpenAIResponseClientWithRetry pattern: sync _inner_get_response that branches on stream and delegates non-streaming calls to _non_streaming_with_retry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

OpenAI's Chat Completions endpoint validates the message `name` field against the pattern `^[^\s<|\\/>]+$`. Our agents have display names with whitespace (e.g. `Chief Architect`, `AKS Expert`), which caused a 400 BadRequest after switching the default client to `AzureOpenAIChatClientWithRetry`. Add `_sanitize_author_name` / `_sanitize_author_names` helpers that replace runs of disallowed characters (whitespace, `<`, `|`, `\`, `/`, `>`) with a single underscore and strip leading/trailing underscores. Names that sanitize down to an empty string are dropped entirely so the field can be omitted from the request. The sanitizer is applied inside `AzureOpenAIChatClientWithRetry._inner_get_response` after context trimming (and again after the trim-fallback retry inside `_non_streaming_with_retry`) so the wire format passes validation while in-memory `Message` objects keep their original display names for orchestration logic. Originals are never mutated — modified messages are shallow-copied before the name is rewritten. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The [AOAI_RETRY] empty messages list received warning fired on every turn in group-chat orchestration when the same speaker was selected twice in a row, flooding logs and giving the false impression of an error. This pattern is by design in agent-framework's GroupChatOrchestrator: _broadcast_messages_to_participants excludes the source executor, so when the orchestrator routes back to the same agent, its message cache is empty. The framework already emits its own "AgentExecutor ... Running agent with empty message cache" warning for this case. The actual API call is not empty -- the parent OpenAIChatCompletionClient._prepare_options prepends the agent's system instructions from options["instructions"] before sending. So demoting our duplicate warning to DEBUG removes the noise without hiding any real failure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The agent_framework_orchestrations.GroupChatBuilder forces the Coordinator's response_format to AgentOrchestrationOutput (strict schema with fields next_speaker/reason/terminate). Our prompt asks for selected_participant/instruction/finish, but strict structured output overrides the prompt's field names. Without aliases, ManagerSelectionResponse.model_validate() silently succeeded with all fields = None (extra=allow), which disabled: - The 3-strike loop-detection streak counter (line 1019-1054) - Coordinator-driven termination on finish=true (line 1065) - _agent_invoked_at[selected] elapsed-time tracking (line 1098) Use Pydantic AliasChoices so the model accepts BOTH naming conventions, restoring anti-loop and termination logic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Production still hits 400 BadRequest on messages[N].name even though _inner_get_response runs _sanitize_author_names on incoming Messages. The framework's _prepare_options/_prepare_messages_for_openai layer or agent-internal compaction can materialize messages with author_name set AFTER our early sanitization, leaving the dict 'name' field unsanitized on the wire. Override _prepare_messages_for_openai (the parent method that builds the final OpenAI dict payload) to sanitize each dict's 'name' field as a last-mile pass. This is the single chokepoint guaranteed to be on every Chat Completions request, regardless of upstream message-construction path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When Coordinator keeps picking the same agent A and A keeps running, A's own completions were bumping _progress_counter. Loop detection compares the counter snapshot taken at the previous identical Coordinator pick against the current value; if it changed, the streak was reset to 1. So the 3-strike threshold was never reached and the Coordinator->A->A pattern ran until max_rounds. Now we only treat a non-Coordinator completion as 'progress' when the completing agent is different from the agent the Coordinator is currently latching onto (_last_coordinator_selection[0]). A different agent stepping in still resets the streak; A repeating itself does not. Adds two regression tests covering both cases. Also updates an existing termination test whose name described 'other agent makes progress' but actually used the same agent, hard-coding the buggy semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ute by capability The Coordinator's valid_participants block was a bullet list of names only, so the LLM had no per-agent capability signal. Combined with a Coordinator prompt that names 'Chief Architect' frequently across phases 0/1/4/5/6, the model latched onto Chief Architect repeatedly and the conversation looped on the same agent. This change populates agent_description on every Analysis participant (Chief Architect, AKS Expert, and the platform experts in platform_registry.json) and renders each description into the Coordinator's valid_participants list. The descriptions are also passed through AgentBuilder.create_agent_by_agentinfo's existing description= argument, so the framework's Agent.description field is no longer always None. Scope: Analysis step only. design/yaml/documentation orchestrators are left for a follow-up after this change is validated in production. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Production run after the previous progress-counter fix (0da531f) STILL showed Chief Architect picked 6+ consecutive times. Root cause: the loop detection key was (agent, instruction_text). The LLM-driven Coordinator varies its instruction on every pick ('list source blobs', 'read xyz.yaml', 'save analysis_result.md') while latching onto the same agent — so every selection_key was unique, the streak reset to 1 on every pick, and the 3-strike threshold was never reached. Change: track only the agent name (lower-cased). The progress counter (now correct after 0da531f) already encodes 'no DIFFERENT agent ran in between', so 3 consecutive picks of the same agent with no other-agent progress is a strong, low-false-positive loop signal. Adds a regression test that replays the production sequence (same agent, three different instruction strings) and verifies forced termination fires. The earlier tests for exact-match repeats and for B-resets-the- streak continue to pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Production deployment of the agent-framework 1.3.0 upgrade surfaced a crash chain: Analysis "succeeded" with a self-contradictory result (result=True, is_hard_terminated=False, output=None), Design then crashed at `task_param.output.process_id`. The root cause is the ResultGenerator returning an empty shell when participants never produced useful content. Fixes: * groupchat_orchestrator.run_stream now validates ResultGenerator output before constructing OrchestrationResult. If the result is not hard terminated but carries no `output` / `termination_output` payload, the orchestrator now reports success=False with a descriptive error. This is generic across all four step models (Analysis uses `output`; Design/Convert/Documentation use `termination_output`). * All four step executors gained a defense-in-depth guard that raises a clear `<Step>Executor failed: produced no <X>Output. Reason: ...` exception when the same incoherent shape is observed. This stops the broken value at the boundary instead of propagating it downstream. * groupchat_orchestrator silent `except Exception: pass` around Coordinator JSON parsing replaced with `logger.debug(... exc_info=...)` so loop-detection failures become visible during debugging instead of being swallowed. Tests: * Updated each executor's existing soft-completion test to provide a valid output (previous setup encoded the broken shape we now reject). * Added a new guard test per executor asserting the new exception fires for the incoherent (success=True + output=None + not hard-terminated) shape. * Full unit suite: 829 passed (was 825; +4 new guard tests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Root cause of the "Runner did not converge after 100 iterations" production failure (and the Chief-Architect-only loop that preceded it): agent-framework 1.3.0 changed how AgentResponseUpdate is constructed. `map_chat_to_agent_update` (_types.py:2825-2837) now only sets `author_name` and leaves `agent_id` as None. Our orchestrator was reading `event.agent_id` exclusively, so every streaming update resolved to `agent_name=""`. That silently broke: * Loop detection (line 1080 `if agent_name == self.coordinator_name` never matched, so the streak counter never advanced and the 3x same-agent guard never fired). Production looped 100x on Chief Architect with zero detection. * Coordinator termination signal extraction (`finish=true`, `instruction=complete`, blocking instructions) - same gated block. * Manager-instruction parsing for the next participant. The [MEMORY] logs continued to show real agent names ("Chief Architect") because `SharedMemoryContextProvider` reads the name from the agent's own context, not from the workflow event - which is why the regression was invisible from logs alone. Fix: in `_handle_agent_update`, prefer `event.author_name` (which IS populated by 1.3.0's `map_chat_to_agent_update`) and fall back to `agent_id` only when author_name is missing, for backwards compat with older event shapes. Use `getattr` defensively so existing tests that construct SimpleNamespace events without author_name still work. Tests: * test_handle_agent_update_resolves_coordinator_via_author_name_when_agent_id_is_none - asserts the identity resolution itself * test_loop_detection_fires_on_3_consecutive_coordinator_selections_via_handle_agent_update - end-to-end through the production code path: 3 identical Coordinator selections via _handle_agent_update must trip _forced_termination * Both tests verified to FAIL without the fix (intentionally reverted to confirm) and PASS with the fix * Full suite: 831 passed (was 829, +2 regression tests) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rounds for af 1.3.0 In agent-framework 1.3.0, `workflow.run(stream=True)` only yields `WorkflowEvent` instances. `AgentResponseUpdate` is wrapped inside `event.data` for `type=="output"` events. The two types are unrelated (verified by MRO), so the previous `isinstance(event, AgentResponseUpdate)` gate from the b260107 era was permanently dead in 1.3.0. As a result every orchestrator-side safety guard inside that branch silently no-opped: * per-agent loop detection * Coordinator finish=true detection * max_rounds enforcement * streaming callback dispatch * manager-instruction extraction That is why production runs hit the framework's own 100-iteration runner cap as `RuntimeError("Runner did not converge after 100 iterations")` even after the recent identity-resolution patch (which only touched code that never executed). Three coordinated fixes: 1. Replace the dead `isinstance(event, AgentResponseUpdate)` gate with `isinstance(event, WorkflowEvent) and event.type == "output"` and inspect `event.data` / `event.executor_id` to distinguish per- participant streaming chunks (executor_id matches one of self.agents and data is AgentResponseUpdate) from the framework orchestrator's final output (list[Message] or custom result object). 2. Add `executor_id` parameter to `_handle_agent_update` so identity resolves from the WorkflowEvent wrapper's executor_id (always populated from `AgentExecutor.id` = the agent's name) first, then falls back to `event.author_name`, then legacy `event.agent_id`. Matches the approach already used by Content Processing Solution. 3. Pass `max_rounds=self.max_rounds` and `intermediate_outputs=True` to `GroupChatBuilder`: - `max_rounds` gives the framework itself a clean termination ceiling so even if our orchestrator-side guards miss, the workflow halts cleanly instead of crashing at the runner's 100-iteration cap. - `intermediate_outputs=True` is required for each participant's `yield_output(AgentResponseUpdate)` call to surface as a workflow `output` event. Without this, only the orchestrator's final yield reaches our streaming loop and the per-agent guards above never run. Tests: * Existing termination/loop-detection tests still pass (handler now has 3-tier identity resolution with backward-compat for `author_name`). * Added `test_handle_agent_update_prefers_executor_id_over_author_name` to lock in the new precedence. * Added `test_handle_agent_update_strips_executor_id_prefix` to cover the `groupchat_agent:Coordinator` framework prefix. * Full suite: 833 passed (was 831; +2 new tests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

In agent-framework 1.3.0 the GroupChat orchestrator agent (Coordinator) is invoked directly inside the framework's internal _invoke_agent_helper (agent_framework_orchestrations/_group_chat.py:484) rather than through an AgentExecutor. The Coordinator therefore never surfaces as a workflow event, which makes our existing Coordinator-JSON-based loop detector in _complete_agent_response permanently dead in 1.3.0. Symptom in production: workflow loops with the Coordinator latched onto the same participant (e.g., Chief Architect repeatedly asked to produce an Evidence Pack that never satisfies the next reviewer). The loop runs until the framework's max_rounds ceiling fires (~17 min at default 100) instead of being caught early. Fix: * Track participant turn completions from WorkflowEvent.executor_completed, the one observable signal that does NOT depend on Coordinator visibility (participants ARE wrapped in AgentExecutor and so do emit these events). * Force-terminate (hard_loop) after 3 consecutive completions of the same participant. * Force-terminate (hard_timeout) when total participant completions reach max_rounds; independent of len(agent_responses) which only grows on agent switch and so can never reach max_rounds during a same-participant loop. * Flush per-participant streaming buffer on each executor_completed so back-to-back same-agent turns produce one AgentResponse per turn instead of accumulating across turns. * Move forced-termination break check to top of the streaming loop so any branch (timeout, participant loop, Coordinator finish=true) takes effect on the very next event rather than waiting for the next output event. Adds 3 regression tests covering the streak trigger, the alternation reset, and the round-budget enforcement. 836 tests pass (833 -> 836). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ed events" This reverts commit 7a0212f.

…p dead code * Add a narrow logging.Filter on agent_framework._workflows._agent_executor that drops only the 'Running agent with empty message cache' message. This warning fires by design in GroupChat orchestration when the orchestrator routes back to the same speaker (broadcast cache is empty because _broadcast_messages_to_participants excludes the source executor). The framework's parent client prepends system instructions before the LLM call, so the API request still has content. Other warnings/errors from the same logger remain visible. * Remove three lines of commented-out duplicate callback invocation in groupchat_orchestrator._complete_agent_response. The live callback handler is in the block directly above; the commented block was refactor debris. No behavioural change. All 833 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-13T11:56:20Z

Tests	Skipped	Failures	Errors	Time
833	0 💤	0 ❌	0 🔥	19.828s ⏱️

github-actions · 2026-06-13T11:56:32Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
TOTAL	3097	208	93%

report-only-changed-files is enabled. No files were changed during this commit :)

Tests	Skipped	Failures	Errors	Time
588	0 💤	0 ❌	0 🔥	24.179s ⏱️

Avijit-Microsoft and others added 28 commits May 21, 2026 11:48

chore: Dev merge to Main

2870a21

Merge pull request #254 from microsoft/dev

20d773e

chore: Dev merge to Main

chore: Dev merge to Main

5e7e467

Merge pull request #273 from microsoft/dev

868bd5d

chore: Dev merge to Main

increase the tokens

27261c7

Remove trailing blank line in azure_openai_response_retry.py (W391)

e364793

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert "Update AZURE_OPENAI_API_VERSION from 2025-03-01-preview to v1"

6938b2b

This reverts commit 1d86176.

Revert "fix(groupchat): detect participant loops via executor_complet…

03387c4

…ed events" This reverts commit 7a0212f.

Prachig-Microsoft temporarily deployed to production June 13, 2026 11:55 — with GitHub Actions Inactive

Prachig-Microsoft changed the title ~~Psl upgrade agent framework final~~ fix: Upgrade azure-ai-projects and agent-framework Jun 13, 2026

Prachig-Microsoft temporarily deployed to production June 13, 2026 11:59 — with GitHub Actions Inactive

Prachig-Microsoft temporarily deployed to production June 13, 2026 12:03 — with GitHub Actions Inactive

Prachig-Microsoft deployed to production June 13, 2026 12:03 — with GitHub Actions Active

Prachig-Microsoft temporarily deployed to production June 13, 2026 12:04 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Upgrade azure-ai-projects and agent-framework#282

fix: Upgrade azure-ai-projects and agent-framework#282
Prachig-Microsoft wants to merge 28 commits into
devfrom
psl-upgrade-agent-framework-final

Prachig-Microsoft commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Prachig-Microsoft commented Jun 13, 2026

Does this introduce a breaking change?

Golden Path Validation

Deployment Validation

What to Check

Other Information

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants