Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLM: Qwen2_VL Example #1027

Merged
merged 6 commits into from
Jan 20, 2025
Merged

VLM: Qwen2_VL Example #1027

merged 6 commits into from
Jan 20, 2025

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jan 2, 2025

Purpose

  • Support compressing Qwen2VLForConditionalGeneration with vision calibration data

Follow-ups

  • Qwen/Qwen2-VL-72B-Instruct has memory issues that are unrelated to the VLM architecture and which result from incorrect assumptions in calculate_offload_device_map. See [WIP] Fix hessian memory requirements #1084
    • When this lands, we'll replace the 2B example with the 72B example, since the accuracy loss from quantizing a 2B is pretty severe

Changes

  • Add tracable model definitionsrc/llmcompressor/transformers/tracing/qwen2_vl.py
    • This mostly involves wrapping functions related to rope with image embeddings
    • The _prepare_4d_causal_attention_mask_with_cache_position function has conditional logic if attention_mask is not None. This might be fixable with metadata in the future
  • Add example script examples/multimodal_vision/qwen2_vl_example.py
    • Qwen2_vl requires some custom data preprocessing and tokenization, which is implemented in the example script

Testing

  • Ran examples/multimodal_vision/qwen2_vl_example.py to completion with both 2B
========== SAMPLE GENERATION ==============
system                                                                                                                  
You are a helpful assistant.
user                                                                                                                    
Please describe the animal in this image
                                                                                                                        
assistant     
The animal in the image is a white kitten. It has a fluffy coat and is resting on a white keyboard. The kitten appears to be comfortable and relaxed, possibly enjoying the warmth of the keyboard.
==========================================

Evaluation

Base

hf-multimodal (pretrained=Qwen/Qwen2-VL-2B-Instruct,dtype=bfloat16,add_bos_token=True,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|     Tasks      |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|----------------|------:|------|-----:|------|---|----:|---|-----:|
|Computer Science|      0|none  |     0|acc   |↑  |  0.2|±  |0.0743|

Quantized

hf-multimodal (pretrained=/home/kyle/llm-compressor/Qwen2-VL-2B-Instruct-W4A16-G128,dtype=bfloat16,add_bos_token=True,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|     Tasks      |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|----------------|------:|------|-----:|------|---|----:|---|-----:|
|Computer Science|      0|none  |     0|acc   |↑  |  0.1|±  |0.0557|

we'll replace the 2B example with the 72B example, since the accuracy loss from quantizing a 2B is pretty severe

@kylesayrs kylesayrs changed the title Kylesayrs/qwen tracable VLM: TracableQwen2VLForConditionalGeneration Jan 2, 2025
@kylesayrs kylesayrs marked this pull request as ready for review January 2, 2025 21:15
@kylesayrs kylesayrs changed the title VLM: TracableQwen2VLForConditionalGeneration VLM: TraceableQwen2VLForConditionalGeneration Jan 4, 2025
@kylesayrs kylesayrs self-assigned this Jan 4, 2025
Base automatically changed from kylesayrs/gptq-partition to main January 8, 2025 22:15
dsikka added a commit that referenced this pull request Jan 8, 2025
## Purpose ##
* Enable oneshot quantization of vision-language models

![VLM
Banner](https://github.com/user-attachments/assets/0d748714-b524-44f4-b850-a721f35d5543)
[Llama_3 2-Vision
Graphviz](https://github.com/user-attachments/assets/6b371ccc-f9f6-4bf2-b4cd-24ed75a3cad0)

## Related Issues ##
* Fixes #91
* Fixes #961
* Fixes #990

## Prerequisites ##
* neuralmagic/compressed-tensors#193
* #917
* #943
  * #955
    * #950
* #998
* #1014

## Changes ##
### VLM Support ###
* Add multimodal examples in `examples/multimodal_vision`
* Modify `custom_offload_device_map` to support models which are not
`XForCausalLM`
* Add custom data collators for VLM models in
`src/llmcompressor/transformers/utils/data_collator.py`

### GPTQModifier ###
* Implement hooks-based compression in `GPTQModifier`
* This replaces layer-compressor, which made many assumptions about
model architecture
* This also enables finer-grained sequential compression such as
[true_sequential](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.GPTQConfig.true_sequential)
* Functions previously implemented in `gptq_wrapper.py` are now
implemented in `gptq_quantize.py`
* Implement `offload_hessians` parameter in `GPTQModifier`
* Implement data-pipelines-based calibration in `GPTQModifier`
* First an attempt will be made to trace the model and run the
`sequential` pipeline
* If that fails, assumptions will be made about the model architecture
and an attempt will be made to run the `layer_sequential` pipeline
* This ensures backwards compatibility with any previously supported
models
* If that fails, then the basic pipeline will be used, which is
guaranteed to run but may require using `offlo ad_hessians`
* Change hessian instability from a `ValueError` to a `_LinAlgError` so
it can be ignored by the gptq pipeline fallback mechanism
* Add support for conv2d as indicated by
[AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ/blob/6689349625de973b9ee3016c28c11f32acf7f02c/auto_gptq/quantization/gptq.py#L45-L54)

### Data Pipelines ###
* Implement the basic skeletons of data pipelines, which are subject to
change when data pipelines are pulled out of modifiers
* Basic Pipeline
* Performs standard forward passes through the model with provided
dataloader
* Used as fallback, as well as in the future for basic calibration
passes
* Layer Sequential Pipeline
  * Refactor of `LayerCompressor` as a straight-forward data pipeline
  * Uses `IntermediatesCache` to handle activation offloading
* Sequential Pipeline
* Utilizes graph tracing implemented by `torch.fx` to trace the graph in
order to determine where sequential targets (layers) exist in the graph
and what their inputs and outputs are
  * Implements BFS algorithm to assign nodes to partitions
* An ideal implementation consolidates partition indices to assign each
node to the latest possible partition, delaying execution. The current
implementation addresses the most common case (node.op == get_attr)
* Each partition (`Subgraph`) is compiled as an executable python
function with the proper inputs and outputs
  * Uses `IntermediatesCache` to handle activation offloading
* Implement `IntermediatesCache` which automagically handles the
offloading and onloading of activations from batches
* This class is capable of offloading many non-standard activation types
such as `Tuple`s and dataclasses such as `BaseModelOutputWithPast`
  * For convenience, the class also handles masking padding
  * The class is tested in `tests/llmcompressor/pipelines/test_cache.py`

### Tracing ###
* In order to support sequential quantization of the large variety of
different multimodal model architectures, some model definitions have to
be altered to support tracing
* If the calibration dataset is text only, most LLMs and VLMs are
traceable without additional work. Multimodal calibration datasets are
more likely to require additional work to make tracable
* For many VLMs (but not all), the vision tower is not traceable without
significant work. However, this only affects sequential error
propagation and (minimal?) increased memory usage, which leaves the door
open for future support for quantizing modules in the vision tower
* Add traceable model definitions for llava, mistral, mllama, and glm
* All copyright licenses allow for alteration and redistribution, the
line `# vllm-project: no copyright` was added in similar style to
[text_generation.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/transformers/finetune/text_generation.py#L18)

## Future Work/ Follow ups ##
* #1027
* #1032
* #1039
* #1030
* Create better data collators capable of handling larger batch sizes in
order to support VLM fine tuning
* Better support prompt masking for multimodal processors in order to
support VLM fine tuning

## Winogrande Evaluations ##

Model | Dataset | Scheme | Runtime | Winogrande |
-- | -- | -- | -- | --
Llama-3-8B | ultrachat | W4A16 | 43m, 2xA4000 | 0.7545 
Llama-3-70B | ultrachat | W4A16 | 303m, 1xH100 | 0.8216 
Mixtral-8x7B | ultrachat | W4A16 | 317m, 1xA100 | 0.8200 
openbmb/MiniCPM3-4B | ultrachat | W4A16 | 63m, 1xA100 | 0.6701 
Qwen2-VL-2B-Instruct | ultrachat | W8A8 | 12m, 2xA4000 | 0.6188 
Qwen2-VL-2B-Instruct | flickr | W8A8 | 24m, 2xA4000 | 0.6093 
Llama-3.2-11B-Vision-Instruct | flickr | W8A8 | 75m, 1xA100 | 0.7837 
Pixtral-12B-2409 | flickr | W8A8 | 52m, 1xA100 | 0.7924 
llava-1.5-7b-hf | flickr | W8A8 | 15m, 1xH100 | 0.7214 
Phi-3-vision-128k-instruct | flickr | W4A16 | 51m, 1xA100 | 0.7151 

`lm_eval --model vllm --model_args
pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True
--tasks winogrande --num_fewshot 5 --batch_size 32`
`lm_eval --model vllm --model_args
pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1
--tasks winogrande --num_fewshot 5 --batch_size 1`

## MMMU Evaluations ##
Credit to @shubhra 

Model | Dataset | Scheme | MMMU
-- | -- | -- | --
Llama-3.2-11B-Vision | N/A | Dense | 0.4144
Llama-3.2-11B-Vision | N/A | FP8-dynamic | 0.4300
Llama-3.2-11B-Vision | flickr | W4A16 | 0.4377
Llama-3.2-11B-Vision | flickr | W4A16-group | 0.4211

Model | Dataset | Scheme | MMMU
-- | -- | -- | --
Llama-3.2-90B-Vision | N/A | Dense | 0.5388
Llama-3.2-90B-Vision | N/A | FP8-dynamic | 0.5278
Llama-3.2-90B-Vision | flickr | W4A16 | 0.5111
Llama-3.2-90B-Vision | flickr | W4A16-group | 0.5477

Model | Dataset | Scheme | MMMU
-- | -- | -- | --
Pixtral-12B-2409 | N/A | Dense | 0.5022
Pixtral-12B-2409 | N/A | FP8-dynamic | 0.5322
Pixtral-12B-2409 | flickr | W4A16 | 0.4500
Pixtral-12B-2409 | flickr | W4A16-group | 0.4689

## Testing ##
*
[Nightly](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996)

---------

Signed-off-by: Kyle Sayers <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
@kylesayrs kylesayrs marked this pull request as draft January 14, 2025 05:17
@kylesayrs kylesayrs removed their assignment Jan 14, 2025
@kylesayrs kylesayrs force-pushed the kylesayrs/qwen-tracable branch from 9f599e0 to ea8f047 Compare January 14, 2025 17:38
@kylesayrs kylesayrs marked this pull request as ready for review January 14, 2025 17:40
@kylesayrs kylesayrs self-assigned this Jan 14, 2025
@kylesayrs kylesayrs marked this pull request as draft January 19, 2025 07:36
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs marked this pull request as ready for review January 20, 2025 16:12
@kylesayrs kylesayrs added the ready When a PR is ready for review label Jan 20, 2025
@kylesayrs kylesayrs changed the title VLM: TraceableQwen2VLForConditionalGeneration VLM: Qwen2VL Example Jan 20, 2025
@kylesayrs kylesayrs changed the title VLM: Qwen2VL Example VLM: Qwen2_VL Example Jan 20, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mgoin mgoin merged commit 4b805fe into main Jan 20, 2025
7 of 8 checks passed
@mgoin mgoin deleted the kylesayrs/qwen-tracable branch January 20, 2025 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants