Skip to content

Repeatkv transform#997

Open
quic-dhirajku wants to merge 5 commits into
quic:mainfrom
quic-dhirajku:repeatkv_transform
Open

Repeatkv transform#997
quic-dhirajku wants to merge 5 commits into
quic:mainfrom
quic-dhirajku:repeatkv_transform

Conversation

@quic-dhirajku
Copy link
Copy Markdown
Contributor

No description provided.

…VLMs. Based on PR quic#625. Addressed most of the comments made on the previous PR.

Repeat check is done on a subset of models during CI, primarily due to difference in configs of such models.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
…ng with changes made for the new transforms.

TODO: Check for the ONNX directory path name being different.
Check if the list of classes for mapping covers all the models that we support.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
…oder Wrappers were added to string mapping list to enable dummy model export for CI.

Changes were made to prevent multiple application of ReplicateKVTransform if done in either Encoder or Decoder Wrapper already.
Modeling files updated to access config in EncoderWrapper as well.
Infra added for causalLM and VLM checks for repeatKV setup CI tests.
CausalLM script APIRunner instantiation moved to allow updated input shapes to be made. Similarly commented export in VLM script since compile will call it with updated changes already.
TODO: Confirm the changes that were made for DeepSeekV3 model for RepeatKV, currently they were removed for a generic approach.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
@quic-rishinr
Copy link
Copy Markdown
Contributor

@ochougul @vbaddi please review the PR

@quic-rishinr quic-rishinr added the 1.22 Release 1.22 candidate label May 25, 2026
Made changes to allow generic name based transformation of heads (num_attention_heads, n_heads, n_head etc).
Minor edits and utils created for this task.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Comment thread QEfficient/base/modeling_qeff.py
Edited the changes as suggested by quic-mamta.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
@quic-dhirajku quic-dhirajku marked this pull request as ready for review May 27, 2026 08:03
@vbaddi
Copy link
Copy Markdown
Contributor

vbaddi commented May 29, 2026

nit: should we rename this to num_replicate_kv_heads? @quic-rishinr @ochougul @quic-dhirajku

architectures = getattr(model_config, "architectures", None) or []
is_deepseek_v3 = "DeepseekV3ForCausalLM" in architectures
if qaic_config:
if is_deepseek_v3 and (qaic_config.get("blocking_mode", None) == "h"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for models w/mla and single kv heads, we do not want to replicate, ex: deepseekv3 is this what is being done here? not clear.

self.model.config.text_config.use_cache = True
else:
self.model.config.use_cache = True
# self.model, replicate_kv_transformed = ReplicateKVHeadTransform.apply(self.model, **kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove commented code.

if cls._is_mla_attention(attn):
# Legacy MLA support: KV compression projection is organized as
# [kv_heads, kv_lora_rank + qk_rope_head_dim, hidden_size].
mla_orig_kv_heads = 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove magic numbers, get it from the constants file

# Generic config key aliases used across model families.
ATTENTION_HEAD_CONFIG_KEYS = ("num_attention_heads", "n_head", "n_heads", "num_heads")
KV_HEAD_CONFIG_KEYS = ("num_key_value_heads", "n_kv_heads", "num_kv_heads", "effective_n_kv_heads")
HIDDEN_SIZE_CONFIG_KEYS = ("hidden_size", "n_embd", "d_model")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this cover all the models we support as of today?

"meta-llama/Llama-3.2-1B",
# "unsloth/gemma-2b",
# "unsloth/gemma-2-2b",
# "TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this commented? any known failures w/awq, gemma, mistral models?

@vbaddi
Copy link
Copy Markdown
Contributor

vbaddi commented May 29, 2026

@quic-dhirajku also added detailed pr desp. about the design and changes added and test plan validated. thanks

qaic_config["head_block_size"] = qaic_config.get("head_block_size", num_devices)
num_kv_heads_repeat = qaic_config.get("num_kv_heads_repeat", 1)
architectures = getattr(model_config, "architectures", None) or []
is_deepseek_v3 = "DeepseekV3ForCausalLM" in architectures
Copy link
Copy Markdown
Contributor

@quic-mamta quic-mamta May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the lines 459-463, not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1.22 Release 1.22 candidate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants