error when quantizing my finetuned 405b model using autoawq

#13
by Atomheart-Father - opened

I copy the code in model card to quantize my won model, but meet this exception:

image.png

the code I used :

image.png

I have enough cpu ram and 8 A800 GPU

image.png

update transformers to 4.43.x

 pip install -U transformers

update transformers to 4.43.x

 pip install -U transformers

already 4.43.x

image.png

add device_map="cuda" or device_map="auto"

model = AutoAWQForCausalLM.from_pretrained(
  model_path, low_cpu_mem_usage=True, use_cache=False,  device_map="cuda"
)

Due to the model card page, I do not need to load the model to GPU. Besides the original BF16 405B model requires 800G+
VRAM, but the instruction below mentioned I can quantize this model with only 80g VRAM
image.png

add device_map="cuda" or device_map="auto"

model = AutoAWQForCausalLM.from_pretrained(
  model_path, low_cpu_mem_usage=True, use_cache=False,  device_map="cuda"
)

I am not trying to using this 4-bit model. I am using AWQ tool to quantize my bf16 405b model to 4-bit.

image.png

Hugging Quants org
β€’
edited Aug 3

Hi here @Atomheart-Father thanks for opening this issue! May I ask which autoawq version do you have installed? As I believe that there was a recent release https://github.com/casper-hansen/AutoAWQ/releases/tag/v0.2.6 adding batched quantization https://github.com/casper-hansen/AutoAWQ/pull/516 which may have something to do with the device placement issue you mentioned above.

So AFAIK there are two solutions if that's the case and you have AutoAWQ 0.2.6 installed, either to downgrade to AutoAWQ 0.2.5, or to run the quantization script with CUDA_VISIBLE_DEVICES=0 (assuming 0 is the index of the GPU you want to use to quantize the model), but since the issue is apparently with the cpu and cuda:0 devices, the best option may be to downgrade to 0.2.5 in the meantime. If that does solve the issue, I'd recommend you to open an issue with the detailed information about that in https://github.com/casper-hansen/AutoAWQ/issues, as if that's the case, may affect other users too.

Hi here @Atomheart-Father thanks for opening this issue! May I ask which autoawq version do you have installed? As I believe that there was a recent release https://github.com/casper-hansen/AutoAWQ/releases/tag/v0.2.6 adding batched quantization https://github.com/casper-hansen/AutoAWQ/pull/516 which may have something to do with the device placement issue you mentioned above.

So AFAIK there are two solutions if that's the case and you have AutoAWQ 0.2.6 installed, either to downgrade to AutoAWQ 0.2.5, or to run the quantization script with CUDA_VISIBLE_DEVICES=0 (assuming 0 is the index of the GPU you want to use to quantize the model), but since the issue is apparently with the cpu and cuda:0 devices, the best option may be to downgrade to 0.2.5 in the meantime. If that does solve the issue, I'd recommend you to open an issue with the detailed information about that in https://github.com/casper-hansen/AutoAWQ/issues, as if that's the case, may affect other users too.

I am using awq0.2.5, still got this exception. CUDA_VISIBLE_DEVICES=0 cannot solve this problem...

I've met the same issue with @Atomheart-Father , also tried different device_map setting, but they all failed to solving the issue.

Hi @alvarobartt , I have the exact same issue like the users above. The code provided in the model card doesn't work neither for autoawq==0.2.5 nor the latest 0.2.6 .
I have access to an 8xA100 80G machine with plenty of CPU RAM. So my question is: how did you do it? I mean what machine did you use and what exact package versions?
Because trying to reverse engineering what changes might autoawq have done is a really long process. Thank you

Btw I have tried playing with max memory like the code below, but that still fails during quantization with OOM (just mentioning to save other people's time)

import fire
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

def main(model_path, quant_path):
    quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    tokenizer.save_pretrained(quant_path)
    # Load model
    model = \
        AutoAWQForCausalLM.from_pretrained(
            model_path,
            **{"low_cpu_mem_usage": True},
            #safetensors=True,
            device_map ="auto",
            max_memory={0: "20GiB", 1: "20GiB", 2: "20GiB",  3: "20GiB",  4: "20GiB", 5: "20GiB", 6: "20GiB", 7: "20GiB","cpu": "900GiB"},
            torch_dtype=torch.float16,
            # offload_folder="offload"
        )

    # Quantize
    model.quantize(tokenizer, quant_config=quant_config)

    # Save quantized model
    model.save_quantized(quant_path, safetensors=False)

if __name__ == '__main__':
    fire.Fire(main)

(don't want to give anyone high hopes, because it's still running and it might crash at any given moment like awq is doing often) but this line here seems to be avoiding the above issue: https://github.com/casper-hansen/AutoAWQ/compare/main...davedgd:AutoAWQ:patch-1#diff-5ea134b0db33752ee601a18b73d2e41aa050f99961dfcf2be285580c44bb4eed
cc: @Atomheart-Father @dong-liuliu

Hugging Quants org
β€’
edited 5 days ago

Hi here @Atomheart-Father , @dong-liuliu and @yannisp the machine I used had 8 x H100 80GiB and ~2TB of CPU RAM (out of which we just used a single H100 80GiB and ~1TB of CPU RAM), but ~1TB of CPU RAM should be enough to load the model on CPU, and then what should fit in GPU VRAM are the hidden layers (126 in this case) which are processed sequentially by AutoAWQ. Also would be great to know where the OOM is coming from i.e. is it OOM from CPU or OOM from GPU?

Thank you this is very helpful, and just to confirm did you use the 0.2.5 version or maybe an older, like 0.2.4 (I mean for autoawq).
The OOM happens, not when I try your code, but the code I shared, and it's during the quantization process, but it's from the GPU and it's a bit random (meaning it can happen at 22/126 or 40/126 of the quantization process).

Hugging Quants org

Oh you're right the version is not pinned, I used AutoAWQ v0.2.5 (see release notes), for both transformers and accelerate I guess is not that relevant here, but for context I used transformers 4.43.0 (see release notes), and accelerate 0.32.0 (see release notes).

Additionally, I used CUDA 12.1 and PyTorch 2.2.1.

thank you this is helpful, I will post how it all goes!

Sign up or log in to comment