Make HF `generate` work on transformers v4.57 and v5 by jlamypoirier · Pull Request #536 · ServiceNow/Fast-LLM

jlamypoirier · 2026-06-08T19:58:09Z

What

Makes HuggingfaceGPTModelForCausalLM.generate() run and match HF greedy output on both transformers v4.57 (the declared floor) and v5, and enables the generate model-testing group for the models where it now passes.

Changes

can_generate() -> True on the wrapper base. transformers v5's PreTrainedModel.can_generate walks __bases__ by name and stops at any base whose name contains "PreTrainedModel"; our intermediate base hid the GenerationMixin inheritance, so the check returned False and generate() died with '... has no attribute generation_config'. Unconditional, so correct on both majors.

inner_forward absorbs cache_position (and other generate plumbing) via **kwargs. On v4.57 generate passes cache_position to forward; v5 filters it out. Ignoring it is correct on the use_cache=False path — positions are reconstructed from attention_mask, and full logits are computed with the last position selected downstream.

The wrapper honors the source HF config's generation token ids. Fast-LLM's import drops bos/eos/pad (generation metadata, not architecture), so generate never stopped at EOS. from_pretrained now applies them from the source HF config to generation_config, mirroring AutoModelForCausalLM.from_pretrained. Exposed as _apply_generation_token_ids so manually-constructed wrappers can opt in. Native Fast-LLM checkpoints (no token ids) keep the defaults.

Prediction-head logits no longer leak into the returned hidden states. inner_forward popped only the main head's logits, so multi-token-prediction heads' logits leaked into output_hidden_states. All heads' logits are now popped (extra ones discarded unless stacked), so the hidden-states count is num_blocks + prediction_heads for any head configuration.

Allowlist is_llama_config in the HF config coverage check. transformers v4 LlamaConfig carries an is_llama_config marker (dropped in v5) that Fast-LLM doesn't consume and a bare PretrainedConfig omits, so the import-boundary check rejected it — Fast-LLM could not import a real transformers-4.x Llama checkpoint. Verified on 4.57.5 that all supported converters (llama/qwen2/mistral/mixtral/mtp_llama) now report no unconsumed config keys.

Generate test group enabled for llama, mistral, mixtral, mtp_llama (verified on transformers 4.57 GPU). The CUDA-bound generate tests gain @requires_cuda so they skip on the CPU-only CI runner; test_export_for_generate stays CPU-runnable. Also fixes stale test code: vocab_size moved under embeddings, hidden-states count generalized, and test_export_for_generate uses the current DistributedTestingConfig signature.

lm_eval split into its own lm_eval testing group (it shared generate), so enabling generate doesn't pull the lm_eval tests — broken on transformers v5 — into CI.

starcoder_2 generate/convert demoted — it has no HF checkpoint format, so the export-based generate tests can't run (generate → not_implemented, was a misleading broken), and its native conversion round-trip is redundant with other models (convert → unimportant).

Left `broken`, with reasons

qwen_2: matches in fp32 but a bf16/flash near-tie argmax flips within the compared horizon (numerical).
diffusion_llama / dream: bidirectional diffusion decoding.

Out of scope (follow-up)

lm_eval on transformers v5: import lm_eval (0.4.9) fails at class-body time (AutoModelForVision2Seq renamed to AutoModelForImageTextToText).

🤖 Generated with Claude Code

Override `can_generate() -> True` on the HF inference wrapper base. v5's `PreTrainedModel.can_generate` walks `__bases__` by name and stops at any base whose name contains "PreTrainedModel"; the intermediate base hides the `GenerationMixin` inheritance, so the check returned False and `generate()` died with "no attribute 'generation_config'". The override is unconditional, correct on both transformers majors. Also fix stale assertions in the generate tests: `vocab_size` now lives under `base_model.embeddings`, and the returned hidden-states count is `num_blocks + 1`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…harness Three fixes needed to make the per-model generate tests run and match HF on transformers 4.57: - `inner_forward` absorbs `cache_position` (and other version-dependent generate plumbing) via `**kwargs`. v4.57's `generate` passes `cache_position` to forward; v5 filters it out. Ignoring it is correct on the `use_cache=False` path. - The HF wrapper's `from_pretrained` applies the source HF config's `bos/eos/pad` token ids to `generation_config` (Fast-LLM's import drops them as non-architecture metadata), so `generate` stops at EOS like `AutoModelForCausalLM`. Exposed as `_apply_generation_token_ids` so manually-constructed wrappers can opt in. - `test_export_for_generate` now passes a `DistributedTestingConfig` (the helper's current signature) instead of a positional list. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Move the `generate` model-testing group from `broken` to `normal` for the models where it now passes end-to-end (verified on transformers 4.57 GPU). The CUDA-bound generate tests gain `@requires_cuda` so they skip on the CPU-only CI runner instead of crashing; `test_export_for_generate` stays CPU-runnable as the dependency root. Models left `broken`, with reasons: qwen_2 (bf16/flash near-tie argmax flip), mtp_llama (forward hidden-states count not modeled for multi-head), starcoder_2 (no converter to export through), diffusion_llama/dream (bidirectional decoding). Split lm_eval into its own `lm_eval` testing group (it previously shared `generate`) so enabling generate doesn't pull the lm_eval tests — broken on transformers v5 — into normal CI. The lm_eval group is unlisted per-config, defaulting to extra-slow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

transformers v4 LlamaConfig carries an `is_llama_config` marker (dropped in v5) that Fast-LLM doesn't consume and that a bare PretrainedConfig omits, so the import-boundary coverage check rejected it — Fast-LLM could not import a real transformers-4.x Llama checkpoint. Add it to the static metadata allowlist. Verified on transformers 4.57.5: all supported HF converters (llama, qwen2, mistral, mixtral, mtp_llama) now report no unconsumed config keys. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

starcoder_2 has no checkpoint_format. The export-based generate tests always skip, so `generate` becomes `not_implemented` (was the misleading `broken`). The convert group still runs the native Distributed<->FastLLM round-trip, but that machinery is exercised by other models, so it drops to `unimportant`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…le mtp_llama `inner_forward` popped only the main head's logits out of the hidden-states namespace, so on multi-token-prediction models the extra heads' logits leaked into the returned `hidden_states`. Pop every head's logits (discarding the prediction heads' when not stacking them). The forward hidden-states count is then `num_blocks + prediction_heads` for any head configuration, generalizing the previous single-head `num_blocks + 1`. With both fixed, mtp_llama's generate tests pass, so its `generate` group moves to `normal` (verified on 4.57 GPU). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jlamypoirier force-pushed the jlp_modernize-hf-generate-wrapper branch from 2dbf360 to 892ba44 Compare June 8, 2026 20:23

jlamypoirier force-pushed the jlp_modernize-hf-generate-wrapper branch from 892ba44 to f72d5d1 Compare June 9, 2026 22:01

jlamypoirier changed the title ~~Make generate and lm_eval run on transformers v5~~ Make HF generate work on transformers v4.57 and v5 Jun 9, 2026

jlamypoirier and others added 4 commits June 10, 2026 14:36

jlamypoirier force-pushed the jlp_modernize-hf-generate-wrapper branch from 47134f6 to 1e8fa93 Compare June 10, 2026 21:31

jlamypoirier and others added 2 commits June 10, 2026 18:33

Clarify qwen_2 broken-reason comment

85ffad7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jlamypoirier merged commit b7833df into main Jun 11, 2026
4 checks passed

jlamypoirier deleted the jlp_modernize-hf-generate-wrapper branch June 11, 2026 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make HF `generate` work on transformers v4.57 and v5#536

Make HF `generate` work on transformers v4.57 and v5#536
jlamypoirier merged 7 commits into
mainfrom
jlp_modernize-hf-generate-wrapper

jlamypoirier commented Jun 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Left broken, with reasons

Out of scope (follow-up)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlamypoirier commented Jun 8, 2026 •

edited

Loading

Left `broken`, with reasons