Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLM: Model Tracing Guide #1030

Open
wants to merge 366 commits into
base: main
Choose a base branch
from
Open

VLM: Model Tracing Guide #1030

wants to merge 366 commits into from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jan 2, 2025

Purpose

This guide explains the concepts of tracing as they relate to LLM Compressor and how to modify your model to support recipes which require using the Sequential Pipeline.

Through reading this guide, you will learn

  1. Why tracing is required when compressing with recipes involving the Sequential Pipeline and modifiers such as GPTQModifier
  2. How to determine if your model is traceable for your dataset
  3. How to modify your model definition to be traceable

Prerequisites

Changes

  • Add a model tracing guide src/llmcompressor/transformers/tracing/README.md with pictures
  • Add a readme for the sequential pipeline which points to the Tracing Guide src/llmcompressor/pipelines/sequential/README.md
  • Add a debug script to help users debug their models for traceability src/llmcompressor/transformers/tracing/debug.py
    • Add the llm-compressor.attempt_trace entrypoint for ease of use
  • Swap the order of arguments in llava_example.py and and pixtral_example.py to match the order of arguments on the modifier

Testing

Use the llmcompressor.attempt_trace debug script

llmcompressor.attempt_trace \
    --model_id llava-hf/llava-1.5-7b-hf
    --model_class TraceableLlavaForConditionalGeneration
    --sequential-targets LlamaDecoderLayer
    --ignore "re:.*lm_head" "re:vision_tower.*" "re:multi_modal_projector.*"
    --multimodal_data

Stretch

It might be nice if this tracing debugger tool also printed the model graph to an svg

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
…tokenized datasets should not be given labels

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
dsikka pushed a commit that referenced this pull request Jan 10, 2025
## Purpose ##
* Allow VLM processors to be used to tokenize datasets with prompt keys

## Postrequisites ##
* #1030

## Changes ##
* Use `text` argument name for tokenizing the prompt column

## Testing ##
* w.r.t. tokenizers, using the `text` kwarg follows the precedent set by
[PretrainedTokenizerBase](https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L2790)
* w.r.t. processors, most processors use the text kwarg

Below are all the models I know to be compatible with this change, I'm
assuming that most other processors follow the same standard
1.
[llama](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L233)
2.
[pixtral](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pixtral/processing_pixtral.py#L160)
3.
[phi3_vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/blob/main/processing_phi3_v.py#L321)
4.
[mllama](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mllama/processing_mllama.py#L232)
5.
[qwen2_vl](https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/processing_qwen2_vl.py#L71)

Example of using VLM processor to tokenize a dataset with prompt key
```python3
from transformers import AutoProcessor
from llmcompressor.transformers import DataTrainingArguments, TextGenerationDataset

models_to_test = [
  "meta-llama/Meta-Llama-3-8B-Instruct",
  "mistralai/Mixtral-8x7B-Instruct-v0.1",
  "Qwen/Qwen2-VL-2B-Instruct",  # fails without changes
  "mgoin/pixtral-12b",  # fails without changes
]

for model_id in models_to_test:
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
  
  data_args = DataTrainingArguments(
      dataset="ultrachat-200k",
      splits={"calibration": "test_sft[:1]"}
  )
  
  dataset = TextGenerationDataset.load_from_registry(
      data_args.dataset,
      data_args=data_args,
      split=data_args.splits["calibration"],
      processor=processor,
  )(add_labels=False)
```

Signed-off-by: Kyle Sayers <[email protected]>
setup.py Outdated Show resolved Hide resolved
return model_cls


def parse_args():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have click in the setup.py, might be worth using for cli

Copy link
Collaborator Author

@kylesayrs kylesayrs Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see a good reason to https://click.palletsprojects.com/en/stable/why/#why-not-argparse, but thanks for the suggestion

legacy_processing = (
(input_ids == self.config.image_token_index).sum(1).max() < self.config.image_seq_length
) or (input_ids.shape[-1] == 1 and pixel_values is not None).item()
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the whole thing.

I like how much time and thought you put into making this doc.

Right now, the audience needs to read until 3rd paragraph to know what the problem is and when to use the tracing -- encoder-decoder models using GPTQ, SparseGPT Modifiers. If we move those to the intro, it will be clearer for the audience to know if the doc is applicable to them or not.

Then a small paragraph introducing what 1, 2, and 3 will be helpful for --
1 shows the description of why it cannot use the previous methods and why the seq pipeline solves the problem, 2. is how to run using cli, and 3. is debugging/contribution.

This way I think the audience can have an easier time navigating to the appropriate section by reading less.

Copy link
Collaborator Author

@kylesayrs kylesayrs Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the audience needs to read until 3rd paragraph to know what the problem is and when to use the tracing

As for when to use tracing, that's described in the second sentence

Through reading this guide, you will learn
1. Why tracing is required when compressing with recipes involving the Sequential Pipeline and modifiers such as GPTQModifier

As for what the problem is, that's described in the first section

## 1. Why is Tracing Required? ##

Copy link
Collaborator Author

@kylesayrs kylesayrs Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the audience needs to read until 3rd paragraph to know what the problem is and when to use the tracing -- encoder-decoder models using GPTQ, SparseGPT Modifiers

That's incorrect, tracing is used for all model architectures, not just encoder-decoder models. As described in the second paragraph, tracing is used when compressing with recipes involving the Sequential Pipeline and modifiers such as GPTQModifier.

Copy link
Collaborator Author

@kylesayrs kylesayrs Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then a small paragraph introducing what 1, 2, and 3 will be helpful for
This way I think the audience can have an easier time navigating to the appropriate section by reading less.

I think the section titles + the list of things you will learn from reading each of the sections is enough context for a reader to go on. For example, if the reader doesn't care about the why, they can skip 1. If the reader doesn't care about what tracability is, they can skip 2. If the reader doesn't care about how to make a model traceable, they can skip 3.

@kylesayrs kylesayrs requested a review from horheynm January 13, 2025 19:25
kylesayrs added a commit that referenced this pull request Jan 15, 2025
## Purpose ##
* Allow VLM processors to be used to tokenize datasets with prompt keys

## Postrequisites ##
* #1030

## Changes ##
* Use `text` argument name for tokenizing the prompt column

## Testing ##
* w.r.t. tokenizers, using the `text` kwarg follows the precedent set by
[PretrainedTokenizerBase](https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L2790)
* w.r.t. processors, most processors use the text kwarg

Below are all the models I know to be compatible with this change, I'm
assuming that most other processors follow the same standard
1.
[llama](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L233)
2.
[pixtral](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pixtral/processing_pixtral.py#L160)
3.
[phi3_vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/blob/main/processing_phi3_v.py#L321)
4.
[mllama](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mllama/processing_mllama.py#L232)
5.
[qwen2_vl](https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/processing_qwen2_vl.py#L71)

Example of using VLM processor to tokenize a dataset with prompt key
```python3
from transformers import AutoProcessor
from llmcompressor.transformers import DataTrainingArguments, TextGenerationDataset

models_to_test = [
  "meta-llama/Meta-Llama-3-8B-Instruct",
  "mistralai/Mixtral-8x7B-Instruct-v0.1",
  "Qwen/Qwen2-VL-2B-Instruct",  # fails without changes
  "mgoin/pixtral-12b",  # fails without changes
]

for model_id in models_to_test:
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

  data_args = DataTrainingArguments(
      dataset="ultrachat-200k",
      splits={"calibration": "test_sft[:1]"}
  )

  dataset = TextGenerationDataset.load_from_registry(
      data_args.dataset,
      data_args=data_args,
      split=data_args.splits["calibration"],
      processor=processor,
  )(add_labels=False)
```

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
mgoin
mgoin previously approved these changes Jan 20, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, we should consider adding a readthedoc build like vLLM to render these out

src/llmcompressor/pipelines/sequential/README.md Outdated Show resolved Hide resolved
Signed-off-by: Kyle Sayers <[email protected]>

Co-authored-by: Michael Goin <[email protected]>
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job.

A couple of nits:

  1. I wouldnt refer to the SparseGPTModifier until we've actually started using data pipelines outside of the GPTQModifier
  2. A helpful comment on what to focus on when looking at the images would be nice

# Sequential Pipeline #
The sequential pipeline is a data pipeline, primarily used for compressing models with the
[GPTQModifier](/src/llmcompressor/modifiers/quantization/gptq/base.py) or the
[SparseGPTModifier](/src/llmcompressor/modifiers/obcq/base.py).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're not yet using the data pipeline in the SparseGPTModifier, I would not include it in the ReadMe just yet

independently at calibration time. For a visual example of a model call graph, see
[Llama_3.2-Vision.svg](/src/llmcompressor/transformers/tracing/assets/Llama_3.2-Vision.svg).

<p align="center">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What am I supposed to be taking away from this image?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image depicts the model graph referenced in the above paragraph. The image is a concrete example of what a model graph looks like and helps illustrated what the nodes and edges are within the graph.

traced (we don't see the individual `MllamaVisionEncoder` layers, ect.). However, we can
no longer target the modules within the `MllamaVisionModel` such as the
`MllamaVisionEncoder` as sequential targets. If any modules within the
`MllamaVisionModel` are being compressed, their hessians be all be allocated at the same
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar their hessians be all be allocated ...

multimodal_data: bool,
sequential_targets: Optional[Union[List[str], str]] = None,
ignore: Union[List[str], str] = [],
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants