Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Quantized_model + float LoRA model scenario to model builder #1043

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

apsonawane
Copy link
Contributor

Add Quantized_model + float LoRA model scenario to model builder

@apsonawane apsonawane marked this pull request as ready for review November 7, 2024 19:02
@@ -437,7 +441,7 @@ def save_model(self, out_dir):
# Quantize ONNX model to desired precision
# TODO: Replace by quantizing the MatMuls as they are created
already_quantized_in_qdq_format = self.quant_type is not None and self.quant_attrs["use_qdq"] # Skip quantizing `MatMul` in `DequantizeLinear --> Transpose --> MatMul` path
if self.onnx_dtype == "int4" and not already_quantized_in_qdq_format:
if self.onnx_dtype == "int4" and not already_quantized_in_qdq_format and not self.matmul_attrs["use_lora"]:
Copy link

@jambayk jambayk Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MatMul4bits quantizer has an option to excludes nodes for quantization https://github.com/microsoft/onnxruntime/blob/e7987a6b0ba429c0bec248c4a471e1782da4be6c/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py#L1342

Maybe instead of a flag, you can keep a set of the lora matmul names and provide it to the quantizer? Otherwise, if the user provides float base + float adapters with int4 as precision, the output model will be fully float. But you might want to quantize the base model?
also for a quantized base model + float adapters, you might want to quantize the lm head like #940? Not sure what effect always quantizing the lm head has on accuracy though.

@@ -334,23 +369,51 @@ def __init__(self, quant_type, input_path, bits, group_size, q_size, kv_size, in
# model.layers.layer_id.mlp.dense_h_to_4h.bias
module.mlp.gate_proj.bias = tensor[: intermediate_size]
module.mlp.down_proj.bias = tensor[intermediate_size: ]
elif bool(re.match(r"^model.layers\.\d+\.self_attn.q_proj.lora_A\.weight$", name)):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these only cover llama type models. phi3 has qkv_proj and gate_up_proj instead of q,k,v,gate_proj,up_proj

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Kunal also mentioned about that. We can support for that as well

@@ -334,23 +369,51 @@ def __init__(self, quant_type, input_path, bits, group_size, q_size, kv_size, in
# model.layers.layer_id.mlp.dense_h_to_4h.bias
module.mlp.gate_proj.bias = tensor[: intermediate_size]
module.mlp.down_proj.bias = tensor[intermediate_size: ]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, looks like this is a bug? shouldn't it be up_proj?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants