VLM: Qwen2_VL Example #1027

kylesayrs · 2025-01-02T18:03:23Z

Purpose

Support compressing Qwen2VLForConditionalGeneration with vision calibration data

Follow-ups

Qwen/Qwen2-VL-72B-Instruct has memory issues that are unrelated to the VLM architecture and which result from incorrect assumptions in calculate_offload_device_map. See [WIP] Fix hessian memory requirements #1084
- When this lands, we'll replace the 2B example with the 72B example, since the accuracy loss from quantizing a 2B is pretty severe

Changes

Add tracable model definitionsrc/llmcompressor/transformers/tracing/qwen2_vl.py
- This mostly involves wrapping functions related to rope with image embeddings
- The _prepare_4d_causal_attention_mask_with_cache_position function has conditional logic if attention_mask is not None. This might be fixable with metadata in the future
Add example script examples/multimodal_vision/qwen2_vl_example.py
- Qwen2_vl requires some custom data preprocessing and tokenization, which is implemented in the example script

Testing

Ran examples/multimodal_vision/qwen2_vl_example.py to completion with both 2B

========== SAMPLE GENERATION ==============
system                                                                                                                  
You are a helpful assistant.
user                                                                                                                    
Please describe the animal in this image
                                                                                                                        
assistant     
The animal in the image is a white kitten. It has a fluffy coat and is resting on a white keyboard. The kitten appears to be comfortable and relaxed, possibly enjoying the warmth of the keyboard.
==========================================

Evaluation

Base

hf-multimodal (pretrained=Qwen/Qwen2-VL-2B-Instruct,dtype=bfloat16,add_bos_token=True,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|     Tasks      |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|----------------|------:|------|-----:|------|---|----:|---|-----:|
|Computer Science|      0|none  |     0|acc   |↑  |  0.2|±  |0.0743|

Quantized

hf-multimodal (pretrained=/home/kyle/llm-compressor/Qwen2-VL-2B-Instruct-W4A16-G128,dtype=bfloat16,add_bos_token=True,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|     Tasks      |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|----------------|------:|------|-----:|------|---|----:|---|-----:|
|Computer Science|      0|none  |     0|acc   |↑  |  0.1|±  |0.0557|

we'll replace the 2B example with the 72B example, since the accuracy loss from quantizing a 2B is pretty severe

@shubhra

## Purpose ## * Enable oneshot quantization of vision-language models ![VLM Banner](https://github.com/user-attachments/assets/0d748714-b524-44f4-b850-a721f35d5543) [Llama_3 2-Vision Graphviz](https://github.com/user-attachments/assets/6b371ccc-f9f6-4bf2-b4cd-24ed75a3cad0) ## Related Issues ## * Fixes #91 * Fixes #961 * Fixes #990 ## Prerequisites ## * neuralmagic/compressed-tensors#193 * #917 * #943 * #955 * #950 * #998 * #1014 ## Changes ## ### VLM Support ### * Add multimodal examples in `examples/multimodal_vision` * Modify `custom_offload_device_map` to support models which are not `XForCausalLM` * Add custom data collators for VLM models in `src/llmcompressor/transformers/utils/data_collator.py` ### GPTQModifier ### * Implement hooks-based compression in `GPTQModifier` * This replaces layer-compressor, which made many assumptions about model architecture * This also enables finer-grained sequential compression such as [true_sequential](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.GPTQConfig.true_sequential) * Functions previously implemented in `gptq_wrapper.py` are now implemented in `gptq_quantize.py` * Implement `offload_hessians` parameter in `GPTQModifier` * Implement data-pipelines-based calibration in `GPTQModifier` * First an attempt will be made to trace the model and run the `sequential` pipeline * If that fails, assumptions will be made about the model architecture and an attempt will be made to run the `layer_sequential` pipeline * This ensures backwards compatibility with any previously supported models * If that fails, then the basic pipeline will be used, which is guaranteed to run but may require using `offlo ad_hessians` * Change hessian instability from a `ValueError` to a `_LinAlgError` so it can be ignored by the gptq pipeline fallback mechanism * Add support for conv2d as indicated by [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ/blob/6689349625de973b9ee3016c28c11f32acf7f02c/auto_gptq/quantization/gptq.py#L45-L54) ### Data Pipelines ### * Implement the basic skeletons of data pipelines, which are subject to change when data pipelines are pulled out of modifiers * Basic Pipeline * Performs standard forward passes through the model with provided dataloader * Used as fallback, as well as in the future for basic calibration passes * Layer Sequential Pipeline * Refactor of `LayerCompressor` as a straight-forward data pipeline * Uses `IntermediatesCache` to handle activation offloading * Sequential Pipeline * Utilizes graph tracing implemented by `torch.fx` to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs are * Implements BFS algorithm to assign nodes to partitions * An ideal implementation consolidates partition indices to assign each node to the latest possible partition, delaying execution. The current implementation addresses the most common case (node.op == get_attr) * Each partition (`Subgraph`) is compiled as an executable python function with the proper inputs and outputs * Uses `IntermediatesCache` to handle activation offloading * Implement `IntermediatesCache` which automagically handles the offloading and onloading of activations from batches * This class is capable of offloading many non-standard activation types such as `Tuple`s and dataclasses such as `BaseModelOutputWithPast` * For convenience, the class also handles masking padding * The class is tested in `tests/llmcompressor/pipelines/test_cache.py` ### Tracing ### * In order to support sequential quantization of the large variety of different multimodal model architectures, some model definitions have to be altered to support tracing * If the calibration dataset is text only, most LLMs and VLMs are traceable without additional work. Multimodal calibration datasets are more likely to require additional work to make tracable * For many VLMs (but not all), the vision tower is not traceable without significant work. However, this only affects sequential error propagation and (minimal?) increased memory usage, which leaves the door open for future support for quantizing modules in the vision tower * Add traceable model definitions for llava, mistral, mllama, and glm * All copyright licenses allow for alteration and redistribution, the line `# vllm-project: no copyright` was added in similar style to [text_generation.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/transformers/finetune/text_generation.py#L18) ## Future Work/ Follow ups ## * #1027 * #1032 * #1039 * #1030 * Create better data collators capable of handling larger batch sizes in order to support VLM fine tuning * Better support prompt masking for multimodal processors in order to support VLM fine tuning ## Winogrande Evaluations ## Model | Dataset | Scheme | Runtime | Winogrande | -- | -- | -- | -- | -- Llama-3-8B | ultrachat | W4A16 | 43m, 2xA4000 | 0.7545 Llama-3-70B | ultrachat | W4A16 | 303m, 1xH100 | 0.8216 Mixtral-8x7B | ultrachat | W4A16 | 317m, 1xA100 | 0.8200 openbmb/MiniCPM3-4B | ultrachat | W4A16 | 63m, 1xA100 | 0.6701 Qwen2-VL-2B-Instruct | ultrachat | W8A8 | 12m, 2xA4000 | 0.6188 Qwen2-VL-2B-Instruct | flickr | W8A8 | 24m, 2xA4000 | 0.6093 Llama-3.2-11B-Vision-Instruct | flickr | W8A8 | 75m, 1xA100 | 0.7837 Pixtral-12B-2409 | flickr | W8A8 | 52m, 1xA100 | 0.7924 llava-1.5-7b-hf | flickr | W8A8 | 15m, 1xH100 | 0.7214 Phi-3-vision-128k-instruct | flickr | W4A16 | 51m, 1xA100 | 0.7151 `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32` `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1` ## MMMU Evaluations ## Credit to @shubhra Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-11B-Vision | N/A | Dense | 0.4144 Llama-3.2-11B-Vision | N/A | FP8-dynamic | 0.4300 Llama-3.2-11B-Vision | flickr | W4A16 | 0.4377 Llama-3.2-11B-Vision | flickr | W4A16-group | 0.4211 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-90B-Vision | N/A | Dense | 0.5388 Llama-3.2-90B-Vision | N/A | FP8-dynamic | 0.5278 Llama-3.2-90B-Vision | flickr | W4A16 | 0.5111 Llama-3.2-90B-Vision | flickr | W4A16-group | 0.5477 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Pixtral-12B-2409 | N/A | Dense | 0.5022 Pixtral-12B-2409 | N/A | FP8-dynamic | 0.5322 Pixtral-12B-2409 | flickr | W4A16 | 0.4500 Pixtral-12B-2409 | flickr | W4A16-group | 0.4689 ## Testing ## * [Nightly](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996) --------- Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>

Signed-off-by: Kyle Sayers <[email protected]>

mgoin

LGTM

kylesayrs changed the title ~~Kylesayrs/qwen tracable~~ VLM: TracableQwen2VLForConditionalGeneration Jan 2, 2025

kylesayrs marked this pull request as ready for review January 2, 2025 21:15

kylesayrs mentioned this pull request Jan 2, 2025

VLM Support via GPTQ Hooks and Data Pipelines #914

Merged

kylesayrs changed the title ~~VLM: TracableQwen2VLForConditionalGeneration~~ VLM: TraceableQwen2VLForConditionalGeneration Jan 4, 2025

kylesayrs self-assigned this Jan 4, 2025

Base automatically changed from kylesayrs/gptq-partition to main January 8, 2025 22:15

kylesayrs marked this pull request as draft January 14, 2025 05:17

kylesayrs removed their assignment Jan 14, 2025

Add TraceableQwen2VLForConditionalGeneration

ea8f047

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/qwen-tracable branch from 9f599e0 to ea8f047 Compare January 14, 2025 17:38

Merge remote-tracking branch 'origin' into kylesayrs/qwen-tracable

4274faf

kylesayrs marked this pull request as ready for review January 14, 2025 17:40

kylesayrs added 2 commits January 14, 2025 18:57

use custom preprocessing and tokenization

282c1c4

Signed-off-by: Kyle Sayers <[email protected]>

use auto device map

f91bd6d

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs self-assigned this Jan 14, 2025

kylesayrs marked this pull request as draft January 19, 2025 07:36

simplify tracing changes

85ef39e

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs marked this pull request as ready for review January 20, 2025 16:12

Merge branch 'main' into kylesayrs/qwen-tracable

137c00a

kylesayrs added the ready When a PR is ready for review label Jan 20, 2025

kylesayrs changed the title ~~VLM: TraceableQwen2VLForConditionalGeneration~~ VLM: Qwen2VL Example Jan 20, 2025

kylesayrs changed the title ~~VLM: Qwen2VL Example~~ VLM: Qwen2_VL Example Jan 20, 2025

mgoin approved these changes Jan 20, 2025

View reviewed changes

mgoin merged commit 4b805fe into main Jan 20, 2025
7 of 8 checks passed

mgoin deleted the kylesayrs/qwen-tracable branch January 20, 2025 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM: Qwen2_VL Example #1027

VLM: Qwen2_VL Example #1027

kylesayrs commented Jan 2, 2025 •

edited

Loading

mgoin left a comment

VLM: Qwen2_VL Example #1027

VLM: Qwen2_VL Example #1027

Conversation

kylesayrs commented Jan 2, 2025 • edited Loading

Purpose

Follow-ups

Changes

Testing

Evaluation

mgoin left a comment

Choose a reason for hiding this comment

kylesayrs commented Jan 2, 2025 •

edited

Loading