Skip to content

Make HF generate work on transformers v4.57 and v5#536

Merged
jlamypoirier merged 7 commits into
mainfrom
jlp_modernize-hf-generate-wrapper
Jun 11, 2026
Merged

Make HF generate work on transformers v4.57 and v5#536
jlamypoirier merged 7 commits into
mainfrom
jlp_modernize-hf-generate-wrapper

Conversation

@jlamypoirier

@jlamypoirier jlamypoirier commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What

Makes HuggingfaceGPTModelForCausalLM.generate() run and match HF greedy output on both transformers v4.57 (the declared floor) and v5, and enables the generate model-testing group for the models where it now passes.

Changes

can_generate() -> True on the wrapper base. transformers v5's PreTrainedModel.can_generate walks __bases__ by name and stops at any base whose name contains "PreTrainedModel"; our intermediate base hid the GenerationMixin inheritance, so the check returned False and generate() died with '... has no attribute generation_config'. Unconditional, so correct on both majors.

inner_forward absorbs cache_position (and other generate plumbing) via **kwargs. On v4.57 generate passes cache_position to forward; v5 filters it out. Ignoring it is correct on the use_cache=False path — positions are reconstructed from attention_mask, and full logits are computed with the last position selected downstream.

The wrapper honors the source HF config's generation token ids. Fast-LLM's import drops bos/eos/pad (generation metadata, not architecture), so generate never stopped at EOS. from_pretrained now applies them from the source HF config to generation_config, mirroring AutoModelForCausalLM.from_pretrained. Exposed as _apply_generation_token_ids so manually-constructed wrappers can opt in. Native Fast-LLM checkpoints (no token ids) keep the defaults.

Prediction-head logits no longer leak into the returned hidden states. inner_forward popped only the main head's logits, so multi-token-prediction heads' logits leaked into output_hidden_states. All heads' logits are now popped (extra ones discarded unless stacked), so the hidden-states count is num_blocks + prediction_heads for any head configuration.

Allowlist is_llama_config in the HF config coverage check. transformers v4 LlamaConfig carries an is_llama_config marker (dropped in v5) that Fast-LLM doesn't consume and a bare PretrainedConfig omits, so the import-boundary check rejected it — Fast-LLM could not import a real transformers-4.x Llama checkpoint. Verified on 4.57.5 that all supported converters (llama/qwen2/mistral/mixtral/mtp_llama) now report no unconsumed config keys.

Generate test group enabled for llama, mistral, mixtral, mtp_llama (verified on transformers 4.57 GPU). The CUDA-bound generate tests gain @requires_cuda so they skip on the CPU-only CI runner; test_export_for_generate stays CPU-runnable. Also fixes stale test code: vocab_size moved under embeddings, hidden-states count generalized, and test_export_for_generate uses the current DistributedTestingConfig signature.

lm_eval split into its own lm_eval testing group (it shared generate), so enabling generate doesn't pull the lm_eval tests — broken on transformers v5 — into CI.

starcoder_2 generate/convert demoted — it has no HF checkpoint format, so the export-based generate tests can't run (generatenot_implemented, was a misleading broken), and its native conversion round-trip is redundant with other models (convertunimportant).

Left broken, with reasons

  • qwen_2: matches in fp32 but a bf16/flash near-tie argmax flips within the compared horizon (numerical).
  • diffusion_llama / dream: bidirectional diffusion decoding.

Out of scope (follow-up)

  • lm_eval on transformers v5: import lm_eval (0.4.9) fails at class-body time (AutoModelForVision2Seq renamed to AutoModelForImageTextToText).

🤖 Generated with Claude Code

@jlamypoirier jlamypoirier force-pushed the jlp_modernize-hf-generate-wrapper branch from 2dbf360 to 892ba44 Compare June 8, 2026 20:23
Override `can_generate() -> True` on the HF inference wrapper base. v5's
`PreTrainedModel.can_generate` walks `__bases__` by name and stops at any base
whose name contains "PreTrainedModel"; the intermediate base hides the
`GenerationMixin` inheritance, so the check returned False and `generate()`
died with "no attribute 'generation_config'". The override is unconditional,
correct on both transformers majors.

Also fix stale assertions in the generate tests: `vocab_size` now lives under
`base_model.embeddings`, and the returned hidden-states count is
`num_blocks + 1`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jlamypoirier jlamypoirier force-pushed the jlp_modernize-hf-generate-wrapper branch from 892ba44 to f72d5d1 Compare June 9, 2026 22:01
@jlamypoirier jlamypoirier changed the title Make generate and lm_eval run on transformers v5 Make HF generate work on transformers v4.57 and v5 Jun 9, 2026
jlamypoirier and others added 4 commits June 10, 2026 14:36
…harness

Three fixes needed to make the per-model generate tests run and match HF on
transformers 4.57:

- `inner_forward` absorbs `cache_position` (and other version-dependent generate
  plumbing) via `**kwargs`. v4.57's `generate` passes `cache_position` to forward;
  v5 filters it out. Ignoring it is correct on the `use_cache=False` path.
- The HF wrapper's `from_pretrained` applies the source HF config's `bos/eos/pad`
  token ids to `generation_config` (Fast-LLM's import drops them as non-architecture
  metadata), so `generate` stops at EOS like `AutoModelForCausalLM`. Exposed as
  `_apply_generation_token_ids` so manually-constructed wrappers can opt in.
- `test_export_for_generate` now passes a `DistributedTestingConfig` (the helper's
  current signature) instead of a positional list.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the `generate` model-testing group from `broken` to `normal` for the models
where it now passes end-to-end (verified on transformers 4.57 GPU). The CUDA-bound
generate tests gain `@requires_cuda` so they skip on the CPU-only CI runner instead
of crashing; `test_export_for_generate` stays CPU-runnable as the dependency root.

Models left `broken`, with reasons: qwen_2 (bf16/flash near-tie argmax flip),
mtp_llama (forward hidden-states count not modeled for multi-head), starcoder_2
(no converter to export through), diffusion_llama/dream (bidirectional decoding).

Split lm_eval into its own `lm_eval` testing group (it previously shared `generate`)
so enabling generate doesn't pull the lm_eval tests — broken on transformers v5 —
into normal CI. The lm_eval group is unlisted per-config, defaulting to extra-slow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
transformers v4 LlamaConfig carries an `is_llama_config` marker (dropped in v5)
that Fast-LLM doesn't consume and that a bare PretrainedConfig omits, so the
import-boundary coverage check rejected it — Fast-LLM could not import a real
transformers-4.x Llama checkpoint. Add it to the static metadata allowlist.

Verified on transformers 4.57.5: all supported HF converters (llama, qwen2,
mistral, mixtral, mtp_llama) now report no unconsumed config keys.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
starcoder_2 has no checkpoint_format. The export-based generate tests always
skip, so `generate` becomes `not_implemented` (was the misleading `broken`).
The convert group still runs the native Distributed<->FastLLM round-trip, but
that machinery is exercised by other models, so it drops to `unimportant`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jlamypoirier jlamypoirier force-pushed the jlp_modernize-hf-generate-wrapper branch from 47134f6 to 1e8fa93 Compare June 10, 2026 21:31
jlamypoirier and others added 2 commits June 10, 2026 18:33
…le mtp_llama

`inner_forward` popped only the main head's logits out of the hidden-states
namespace, so on multi-token-prediction models the extra heads' logits leaked
into the returned `hidden_states`. Pop every head's logits (discarding the
prediction heads' when not stacking them). The forward hidden-states count is
then `num_blocks + prediction_heads` for any head configuration, generalizing
the previous single-head `num_blocks + 1`. With both fixed, mtp_llama's generate
tests pass, so its `generate` group moves to `normal` (verified on 4.57 GPU).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jlamypoirier jlamypoirier merged commit b7833df into main Jun 11, 2026
4 checks passed
@jlamypoirier jlamypoirier deleted the jlp_modernize-hf-generate-wrapper branch June 11, 2026 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant