Make HF generate work on transformers v4.57 and v5#536
Merged
Conversation
2dbf360 to
892ba44
Compare
Override `can_generate() -> True` on the HF inference wrapper base. v5's `PreTrainedModel.can_generate` walks `__bases__` by name and stops at any base whose name contains "PreTrainedModel"; the intermediate base hides the `GenerationMixin` inheritance, so the check returned False and `generate()` died with "no attribute 'generation_config'". The override is unconditional, correct on both transformers majors. Also fix stale assertions in the generate tests: `vocab_size` now lives under `base_model.embeddings`, and the returned hidden-states count is `num_blocks + 1`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
892ba44 to
f72d5d1
Compare
generate and lm_eval run on transformers v5generate work on transformers v4.57 and v5
…harness Three fixes needed to make the per-model generate tests run and match HF on transformers 4.57: - `inner_forward` absorbs `cache_position` (and other version-dependent generate plumbing) via `**kwargs`. v4.57's `generate` passes `cache_position` to forward; v5 filters it out. Ignoring it is correct on the `use_cache=False` path. - The HF wrapper's `from_pretrained` applies the source HF config's `bos/eos/pad` token ids to `generation_config` (Fast-LLM's import drops them as non-architecture metadata), so `generate` stops at EOS like `AutoModelForCausalLM`. Exposed as `_apply_generation_token_ids` so manually-constructed wrappers can opt in. - `test_export_for_generate` now passes a `DistributedTestingConfig` (the helper's current signature) instead of a positional list. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the `generate` model-testing group from `broken` to `normal` for the models where it now passes end-to-end (verified on transformers 4.57 GPU). The CUDA-bound generate tests gain `@requires_cuda` so they skip on the CPU-only CI runner instead of crashing; `test_export_for_generate` stays CPU-runnable as the dependency root. Models left `broken`, with reasons: qwen_2 (bf16/flash near-tie argmax flip), mtp_llama (forward hidden-states count not modeled for multi-head), starcoder_2 (no converter to export through), diffusion_llama/dream (bidirectional decoding). Split lm_eval into its own `lm_eval` testing group (it previously shared `generate`) so enabling generate doesn't pull the lm_eval tests — broken on transformers v5 — into normal CI. The lm_eval group is unlisted per-config, defaulting to extra-slow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
transformers v4 LlamaConfig carries an `is_llama_config` marker (dropped in v5) that Fast-LLM doesn't consume and that a bare PretrainedConfig omits, so the import-boundary coverage check rejected it — Fast-LLM could not import a real transformers-4.x Llama checkpoint. Add it to the static metadata allowlist. Verified on transformers 4.57.5: all supported HF converters (llama, qwen2, mistral, mixtral, mtp_llama) now report no unconsumed config keys. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
starcoder_2 has no checkpoint_format. The export-based generate tests always skip, so `generate` becomes `not_implemented` (was the misleading `broken`). The convert group still runs the native Distributed<->FastLLM round-trip, but that machinery is exercised by other models, so it drops to `unimportant`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
47134f6 to
1e8fa93
Compare
…le mtp_llama `inner_forward` popped only the main head's logits out of the hidden-states namespace, so on multi-token-prediction models the extra heads' logits leaked into the returned `hidden_states`. Pop every head's logits (discarding the prediction heads' when not stacking them). The forward hidden-states count is then `num_blocks + prediction_heads` for any head configuration, generalizing the previous single-head `num_blocks + 1`. With both fixed, mtp_llama's generate tests pass, so its `generate` group moves to `normal` (verified on 4.57 GPU). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Makes
HuggingfaceGPTModelForCausalLM.generate()run and match HF greedy output on both transformers v4.57 (the declared floor) and v5, and enables thegeneratemodel-testing group for the models where it now passes.Changes
can_generate() -> Trueon the wrapper base. transformers v5'sPreTrainedModel.can_generatewalks__bases__by name and stops at any base whose name contains"PreTrainedModel"; our intermediate base hid theGenerationMixininheritance, so the check returnedFalseandgenerate()died with'... has no attribute generation_config'. Unconditional, so correct on both majors.inner_forwardabsorbscache_position(and other generate plumbing) via**kwargs. On v4.57generatepassescache_positiontoforward; v5 filters it out. Ignoring it is correct on theuse_cache=Falsepath — positions are reconstructed fromattention_mask, and full logits are computed with the last position selected downstream.The wrapper honors the source HF config's generation token ids. Fast-LLM's import drops
bos/eos/pad(generation metadata, not architecture), sogeneratenever stopped at EOS.from_pretrainednow applies them from the source HF config togeneration_config, mirroringAutoModelForCausalLM.from_pretrained. Exposed as_apply_generation_token_idsso manually-constructed wrappers can opt in. Native Fast-LLM checkpoints (no token ids) keep the defaults.Prediction-head logits no longer leak into the returned hidden states.
inner_forwardpopped only the main head's logits, so multi-token-prediction heads' logits leaked intooutput_hidden_states. All heads' logits are now popped (extra ones discarded unless stacked), so the hidden-states count isnum_blocks + prediction_headsfor any head configuration.Allowlist
is_llama_configin the HF config coverage check. transformers v4LlamaConfigcarries anis_llama_configmarker (dropped in v5) that Fast-LLM doesn't consume and a barePretrainedConfigomits, so the import-boundary check rejected it — Fast-LLM could not import a real transformers-4.x Llama checkpoint. Verified on 4.57.5 that all supported converters (llama/qwen2/mistral/mixtral/mtp_llama) now report no unconsumed config keys.Generate test group enabled for
llama,mistral,mixtral,mtp_llama(verified on transformers 4.57 GPU). The CUDA-bound generate tests gain@requires_cudaso they skip on the CPU-only CI runner;test_export_for_generatestays CPU-runnable. Also fixes stale test code:vocab_sizemoved underembeddings, hidden-states count generalized, andtest_export_for_generateuses the currentDistributedTestingConfigsignature.lm_eval split into its own
lm_evaltesting group (it sharedgenerate), so enabling generate doesn't pull the lm_eval tests — broken on transformers v5 — into CI.starcoder_2 generate/convert demoted — it has no HF checkpoint format, so the export-based generate tests can't run (
generate→not_implemented, was a misleadingbroken), and its native conversion round-trip is redundant with other models (convert→unimportant).Left
broken, with reasonsqwen_2: matches in fp32 but a bf16/flash near-tie argmax flips within the compared horizon (numerical).diffusion_llama/dream: bidirectional diffusion decoding.Out of scope (follow-up)
import lm_eval(0.4.9) fails at class-body time (AutoModelForVision2Seqrenamed toAutoModelForImageTextToText).🤖 Generated with Claude Code