Skip to content

fix: Upgrade azure-ai-projects and agent-framework#282

Draft
Prachig-Microsoft wants to merge 28 commits into
devfrom
psl-upgrade-agent-framework-final
Draft

fix: Upgrade azure-ai-projects and agent-framework#282
Prachig-Microsoft wants to merge 28 commits into
devfrom
psl-upgrade-agent-framework-final

Conversation

@Prachig-Microsoft

Copy link
Copy Markdown
Contributor

This pull request makes significant updates to the agent framework integration and developer experience, focusing on modernizing the agent builder interface, updating dependencies, and improving documentation for clarity and correctness. The most important changes are grouped below:

Agent Framework Refactor and Modernization:

  • Refactored agent_builder.py to use the new Agent class instead of ChatAgent, updated type hints to support new interfaces (e.g., SupportsChatGetResponse, HistoryProvider), and reworked option handling to use a ChatOptions object. Added logic to strip unsupported parameters for certain reasoning models and improved middleware/context provider handling for greater flexibility. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

Dependency Updates:

  • Upgraded agent-framework to version 1.3.0 in src/processor/pyproject.toml and azure-ai-projects to 2.1.0 in infra/vscode_web/requirements.txt to ensure compatibility with the latest features and bug fixes. [1] [2]

SDK Usage Modernization:

  • Updated infra/vscode_web/codeSample.py to use the latest azure-ai-projects API patterns, including the new import for ListSortOrder, revised thread/message/run creation methods, and improved error handling and message printing logic. [1] [2]

Documentation Improvements:

  • Clarified the need for the --prerelease=allow flag in dependency installation instructions and provided more context about transitive pre-release dependencies in docs/LocalDevelopmentSetup.md.
  • Updated workflow construction instructions in docs/ProcessFrameworkGuide.md to reflect the new WorkflowBuilder API, and simplified the unit test command to remove unnecessary flags. [1] [2]

These changes collectively modernize the codebase, improve developer onboarding, and ensure compatibility with the latest SDK and agent framework features.## Purpose

  • ...

Does this introduce a breaking change?

  • Yes
  • No

Golden Path Validation

  • I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.

Deployment Validation

  • I have validated the deployment process successfully and all services are running as expected with this change.

What to Check

Verify that the following are valid

  • ...

Other Information

Avijit-Microsoft and others added 28 commits May 21, 2026 11:48
chore: Dev merge to Main
chore: Dev merge to Main
- Update agent-framework from 1.0.0b260107 to 1.3.0 in pyproject.toml
- Update azure-ai-projects from 1.0.0b12 to 2.1.0 in requirements.txt
- Migrate ChatAgent to Agent (client=, default_options=ChatOptions)
- Migrate agent_framework.azure to agent_framework.openai module paths
- Migrate ChatMessage to Message with Content.from_text()
- Migrate Role enum to string literals
- Migrate AgentRunContext to AgentContext
- Migrate WorkflowBuilder to new API (start_executor=, add_chain)
- Migrate event handling from isinstance checks to WorkflowEvent.type
- Migrate GroupChatBuilder to agent_framework.orchestrations module
- Migrate ContextProvider to before_run/after_run interface
- Remove ToolProtocol (use Any), AgentProtocol (use SupportsAgentRun)
- Define ManagerSelectionResponse locally (removed from framework)
- Update MCP tool files for Agent import
- Update all unit tests for new APIs (812 tests passing)
- Update docs/ProcessFrameworkGuide.md with new WorkflowBuilder example
- Update docs/LocalDevelopmentSetup.md prerelease note
- Regenerate uv.lock

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AgentBuilder.with_context_providers() and with_middleware() accepted single
objects but passed them directly to Agent(), which expects Sequence types.
Now both methods auto-wrap single items into a list.

Also wrapped the call site in orchestrator_base.py for clarity.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The parent OpenAIChatClient._inner_get_response is a regular def that
returns ResponseStream (async iterable) when stream=True, or Awaitable
when stream=False. The override was async def, which always returned a
coroutine, breaking 'async for event in workflow.run(stream=True)'.

Refactored to:
- Regular def _inner_get_response dispatching stream vs non-stream
- _non_streaming_with_retry: async coroutine with retry + context-trim
- _streaming_with_retry: async generator with pre-first-chunk retry
- _maybe_trim_messages: shared context-trim helper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…erator

The framework's BaseChatClient.get_response checks
isinstance(result, ResponseStream) for streaming responses. Our async
generator from _streaming_with_retry failed that check, causing the
framework to 'await' it — which fails with 'object async_generator
can't be used in await expression'.

Fix: for streaming, pass through to the parent's _inner_get_response
which returns a proper ResponseStream. Retry is preserved for
non-streaming calls. Removed unused _streaming_with_retry method.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- _trim_messages: keep at least 1 message (never pop to empty)
- _maybe_trim_messages: fall back to originals if trim produces empty
- _non_streaming_with_retry: re-raise if aggressive trim empties list
- _inner_get_response: log warning and use originals if messages empty

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Responses API requires the new v1 API endpoint. The old preview
version (2025-03-01-preview) does not support the /responses endpoint,
causing BadRequest 'API version not supported' errors at runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pletions API)

Adds a new AzureOpenAIChatClientWithRetry that wraps OpenAIChatCompletionClient

(the /chat/completions endpoint) with the same 429-retry and context-trimming

logic as the existing AzureOpenAIResponseClientWithRetry, then switches the

default client registered in AgentFrameworkHelper and the per-thread client in

OrchestratorBase to use it.

The /chat/completions endpoint works with the existing 2025-03-01-preview

Azure OpenAI API version, so the v1 API-version bump (commit 1d86176) is no

longer required and is reverted in the prior commit.

Mirrors the approach used in microsoft/content-processing-solution-accelerator#599.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eter

OpenAIChatCompletionClient._inner_get_response is a SYNC method that returns

either Awaitable[ChatResponse] (stream=False) or ResponseStream (stream=True),

matching the OpenAIChatClient (Responses API) shape.

The previous implementation used async def without a stream parameter, which

caused the framework's streaming path to receive a coroutine instead of an

AsyncIterable, raising:

    'async for' requires an object with __aiter__ method, got coroutine

Mirror the existing AzureOpenAIResponseClientWithRetry pattern: sync _inner_get_response

that branches on stream and delegates non-streaming calls to _non_streaming_with_retry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OpenAI's Chat Completions endpoint validates the message `name` field against the pattern `^[^\s<|\\/>]+$`. Our agents have display names with whitespace (e.g. `Chief Architect`, `AKS Expert`), which caused a 400 BadRequest after switching the default client to `AzureOpenAIChatClientWithRetry`.

Add `_sanitize_author_name` / `_sanitize_author_names` helpers that replace runs of disallowed characters (whitespace, `<`, `|`, `\`, `/`, `>`) with a single underscore and strip leading/trailing underscores. Names that sanitize down to an empty string are dropped entirely so the field can be omitted from the request.

The sanitizer is applied inside `AzureOpenAIChatClientWithRetry._inner_get_response` after context trimming (and again after the trim-fallback retry inside `_non_streaming_with_retry`) so the wire format passes validation while in-memory `Message` objects keep their original display names for orchestration logic. Originals are never mutated — modified messages are shallow-copied before the name is rewritten.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The [AOAI_RETRY] empty messages list received warning fired on every
turn in group-chat orchestration when the same speaker was selected
twice in a row, flooding logs and giving the false impression of an
error.

This pattern is by design in agent-framework's GroupChatOrchestrator:
_broadcast_messages_to_participants excludes the source executor, so
when the orchestrator routes back to the same agent, its message
cache is empty. The framework already emits its own
"AgentExecutor ... Running agent with empty message cache" warning
for this case.

The actual API call is not empty -- the parent
OpenAIChatCompletionClient._prepare_options prepends the agent's
system instructions from options["instructions"] before sending. So
demoting our duplicate warning to DEBUG removes the noise without
hiding any real failure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The agent_framework_orchestrations.GroupChatBuilder forces the Coordinator's response_format to AgentOrchestrationOutput (strict schema with fields next_speaker/reason/terminate). Our prompt asks for selected_participant/instruction/finish, but strict structured output overrides the prompt's field names.

Without aliases, ManagerSelectionResponse.model_validate() silently succeeded with all fields = None (extra=allow), which disabled:

  - The 3-strike loop-detection streak counter (line 1019-1054)

  - Coordinator-driven termination on finish=true (line 1065)

  - _agent_invoked_at[selected] elapsed-time tracking (line 1098)

Use Pydantic AliasChoices so the model accepts BOTH naming conventions, restoring anti-loop and termination logic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Production still hits 400 BadRequest on messages[N].name even though _inner_get_response runs _sanitize_author_names on incoming Messages. The framework's _prepare_options/_prepare_messages_for_openai layer or agent-internal compaction can materialize messages with author_name set AFTER our early sanitization, leaving the dict 'name' field unsanitized on the wire.

Override _prepare_messages_for_openai (the parent method that builds the final OpenAI dict payload) to sanitize each dict's 'name' field as a last-mile pass. This is the single chokepoint guaranteed to be on every Chat Completions request, regardless of upstream message-construction path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When Coordinator keeps picking the same agent A and A keeps running, A's
own completions were bumping _progress_counter. Loop detection compares
the counter snapshot taken at the previous identical Coordinator pick
against the current value; if it changed, the streak was reset to 1. So
the 3-strike threshold was never reached and the Coordinator->A->A
pattern ran until max_rounds.

Now we only treat a non-Coordinator completion as 'progress' when the
completing agent is different from the agent the Coordinator is
currently latching onto (_last_coordinator_selection[0]). A different
agent stepping in still resets the streak; A repeating itself does not.

Adds two regression tests covering both cases. Also updates an existing
termination test whose name described 'other agent makes progress' but
actually used the same agent, hard-coding the buggy semantics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ute by capability

The Coordinator's valid_participants block was a bullet list of names only,
so the LLM had no per-agent capability signal. Combined with a Coordinator
prompt that names 'Chief Architect' frequently across phases 0/1/4/5/6,
the model latched onto Chief Architect repeatedly and the conversation
looped on the same agent.

This change populates agent_description on every Analysis participant
(Chief Architect, AKS Expert, and the platform experts in
platform_registry.json) and renders each description into the Coordinator's
valid_participants list. The descriptions are also passed through
AgentBuilder.create_agent_by_agentinfo's existing description= argument,
so the framework's Agent.description field is no longer always None.

Scope: Analysis step only. design/yaml/documentation orchestrators are
left for a follow-up after this change is validated in production.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Production run after the previous progress-counter fix (0da531f) STILL
showed Chief Architect picked 6+ consecutive times. Root cause: the
loop detection key was (agent, instruction_text). The LLM-driven
Coordinator varies its instruction on every pick ('list source blobs',
'read xyz.yaml', 'save analysis_result.md') while latching onto the
same agent — so every selection_key was unique, the streak reset to 1
on every pick, and the 3-strike threshold was never reached.

Change: track only the agent name (lower-cased). The progress counter
(now correct after 0da531f) already encodes 'no DIFFERENT agent ran in
between', so 3 consecutive picks of the same agent with no other-agent
progress is a strong, low-false-positive loop signal.

Adds a regression test that replays the production sequence (same agent,
three different instruction strings) and verifies forced termination
fires. The earlier tests for exact-match repeats and for B-resets-the-
streak continue to pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Production deployment of the agent-framework 1.3.0 upgrade surfaced a
crash chain: Analysis "succeeded" with a self-contradictory result
(result=True, is_hard_terminated=False, output=None), Design then
crashed at `task_param.output.process_id`. The root cause is the
ResultGenerator returning an empty shell when participants never
produced useful content.

Fixes:

* groupchat_orchestrator.run_stream now validates ResultGenerator output
  before constructing OrchestrationResult. If the result is not hard
  terminated but carries no `output` / `termination_output` payload, the
  orchestrator now reports success=False with a descriptive error. This
  is generic across all four step models (Analysis uses `output`;
  Design/Convert/Documentation use `termination_output`).
* All four step executors gained a defense-in-depth guard that raises a
  clear `<Step>Executor failed: produced no <X>Output. Reason: ...`
  exception when the same incoherent shape is observed. This stops the
  broken value at the boundary instead of propagating it downstream.
* groupchat_orchestrator silent `except Exception: pass` around
  Coordinator JSON parsing replaced with `logger.debug(... exc_info=...)`
  so loop-detection failures become visible during debugging instead of
  being swallowed.

Tests:

* Updated each executor's existing soft-completion test to provide a
  valid output (previous setup encoded the broken shape we now reject).
* Added a new guard test per executor asserting the new exception fires
  for the incoherent (success=True + output=None + not hard-terminated)
  shape.
* Full unit suite: 829 passed (was 825; +4 new guard tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause of the "Runner did not converge after 100 iterations"
production failure (and the Chief-Architect-only loop that preceded it):
agent-framework 1.3.0 changed how AgentResponseUpdate is constructed.
`map_chat_to_agent_update` (_types.py:2825-2837) now only sets
`author_name` and leaves `agent_id` as None.

Our orchestrator was reading `event.agent_id` exclusively, so every
streaming update resolved to `agent_name=""`. That silently broke:

  * Loop detection (line 1080 `if agent_name == self.coordinator_name`
    never matched, so the streak counter never advanced and the 3x
    same-agent guard never fired). Production looped 100x on Chief
    Architect with zero detection.
  * Coordinator termination signal extraction (`finish=true`,
    `instruction=complete`, blocking instructions) - same gated block.
  * Manager-instruction parsing for the next participant.

The [MEMORY] logs continued to show real agent names ("Chief Architect")
because `SharedMemoryContextProvider` reads the name from the agent's
own context, not from the workflow event - which is why the regression
was invisible from logs alone.

Fix: in `_handle_agent_update`, prefer `event.author_name` (which IS
populated by 1.3.0's `map_chat_to_agent_update`) and fall back to
`agent_id` only when author_name is missing, for backwards compat with
older event shapes. Use `getattr` defensively so existing tests that
construct SimpleNamespace events without author_name still work.

Tests:

* test_handle_agent_update_resolves_coordinator_via_author_name_when_agent_id_is_none
  - asserts the identity resolution itself
* test_loop_detection_fires_on_3_consecutive_coordinator_selections_via_handle_agent_update
  - end-to-end through the production code path: 3 identical Coordinator
    selections via _handle_agent_update must trip _forced_termination
* Both tests verified to FAIL without the fix (intentionally reverted to
  confirm) and PASS with the fix
* Full suite: 831 passed (was 829, +2 regression tests)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rounds for af 1.3.0

In agent-framework 1.3.0, `workflow.run(stream=True)` only yields
`WorkflowEvent` instances. `AgentResponseUpdate` is wrapped inside
`event.data` for `type=="output"` events. The two types are unrelated
(verified by MRO), so the previous `isinstance(event, AgentResponseUpdate)`
gate from the b260107 era was permanently dead in 1.3.0. As a result every
orchestrator-side safety guard inside that branch silently no-opped:

* per-agent loop detection
* Coordinator finish=true detection
* max_rounds enforcement
* streaming callback dispatch
* manager-instruction extraction

That is why production runs hit the framework's own 100-iteration runner
cap as `RuntimeError("Runner did not converge after 100 iterations")`
even after the recent identity-resolution patch (which only touched code
that never executed).

Three coordinated fixes:

1. Replace the dead `isinstance(event, AgentResponseUpdate)` gate with
   `isinstance(event, WorkflowEvent) and event.type == "output"` and
   inspect `event.data` / `event.executor_id` to distinguish per-
   participant streaming chunks (executor_id matches one of self.agents
   and data is AgentResponseUpdate) from the framework orchestrator's
   final output (list[Message] or custom result object).

2. Add `executor_id` parameter to `_handle_agent_update` so identity
   resolves from the WorkflowEvent wrapper's executor_id (always populated
   from `AgentExecutor.id` = the agent's name) first, then falls back
   to `event.author_name`, then legacy `event.agent_id`. Matches the
   approach already used by Content Processing Solution.

3. Pass `max_rounds=self.max_rounds` and `intermediate_outputs=True`
   to `GroupChatBuilder`:
   - `max_rounds` gives the framework itself a clean termination
     ceiling so even if our orchestrator-side guards miss, the workflow
     halts cleanly instead of crashing at the runner's 100-iteration cap.
   - `intermediate_outputs=True` is required for each participant's
     `yield_output(AgentResponseUpdate)` call to surface as a workflow
     `output` event. Without this, only the orchestrator's final yield
     reaches our streaming loop and the per-agent guards above never run.

Tests:
* Existing termination/loop-detection tests still pass (handler now has
  3-tier identity resolution with backward-compat for `author_name`).
* Added `test_handle_agent_update_prefers_executor_id_over_author_name`
  to lock in the new precedence.
* Added `test_handle_agent_update_strips_executor_id_prefix` to cover
  the `groupchat_agent:Coordinator` framework prefix.
* Full suite: 833 passed (was 831; +2 new tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In agent-framework 1.3.0 the GroupChat orchestrator agent (Coordinator)
is invoked directly inside the framework's internal _invoke_agent_helper
(agent_framework_orchestrations/_group_chat.py:484) rather than through
an AgentExecutor. The Coordinator therefore never surfaces as a workflow
event, which makes our existing Coordinator-JSON-based loop detector in
_complete_agent_response permanently dead in 1.3.0.

Symptom in production: workflow loops with the Coordinator latched onto
the same participant (e.g., Chief Architect repeatedly asked to produce
an Evidence Pack that never satisfies the next reviewer). The loop runs
until the framework's max_rounds ceiling fires (~17 min at default 100)
instead of being caught early.

Fix:
* Track participant turn completions from WorkflowEvent.executor_completed,
  the one observable signal that does NOT depend on Coordinator visibility
  (participants ARE wrapped in AgentExecutor and so do emit these events).
* Force-terminate (hard_loop) after 3 consecutive completions of the same
  participant.
* Force-terminate (hard_timeout) when total participant completions reach
  max_rounds; independent of len(agent_responses) which only grows on
  agent switch and so can never reach max_rounds during a same-participant
  loop.
* Flush per-participant streaming buffer on each executor_completed so
  back-to-back same-agent turns produce one AgentResponse per turn instead
  of accumulating across turns.
* Move forced-termination break check to top of the streaming loop so any
  branch (timeout, participant loop, Coordinator finish=true) takes effect
  on the very next event rather than waiting for the next output event.

Adds 3 regression tests covering the streak trigger, the alternation
reset, and the round-budget enforcement. 836 tests pass (833 -> 836).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…p dead code

* Add a narrow logging.Filter on agent_framework._workflows._agent_executor
  that drops only the 'Running agent with empty message cache' message.
  This warning fires by design in GroupChat orchestration when the orchestrator
  routes back to the same speaker (broadcast cache is empty because
  _broadcast_messages_to_participants excludes the source executor). The
  framework's parent client prepends system instructions before the LLM call,
  so the API request still has content. Other warnings/errors from the same
  logger remain visible.

* Remove three lines of commented-out duplicate callback invocation in
  groupchat_orchestrator._complete_agent_response. The live callback handler
  is in the block directly above; the commented block was refactor debris.

No behavioural change. All 833 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown

Coverage

Tests Skipped Failures Errors Time
833 0 💤 0 ❌ 0 🔥 19.828s ⏱️

@github-actions

Copy link
Copy Markdown

Coverage

Coverage Report •
FileStmtsMissCoverMissing
TOTAL309720893% 
report-only-changed-files is enabled. No files were changed during this commit :)

Tests Skipped Failures Errors Time
588 0 💤 0 ❌ 0 🔥 24.179s ⏱️

@Prachig-Microsoft Prachig-Microsoft changed the title Psl upgrade agent framework final fix: Upgrade azure-ai-projects and agent-framework Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants