Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce outlines.models.transformers_vision #1052

Merged
merged 1 commit into from
Jul 19, 2024

Conversation

lapp0
Copy link
Collaborator

@lapp0 lapp0 commented Jul 19, 2024

Rendered Docs: https://github.com/lapp0/outlines/blob/multimodal-models/docs/reference/models/transformers_vision.md

Changes

  • Introduce models.transformers_vision which subclasses models.transformers and overrides its behavior so it applies, instead of AutoTokenizer, AutoProcessor to handle the text AND PIL.Images media
  • Introduce VisionSequenceGeneratorAdapter, handling and validating the media argument.
  • Update outlines.generate to dispatch TransformersVision models to VisionSequenceGeneratorAdapter

Tests

  • tests/generate/test_api.py: Test prompt / media validation
  • tests/generate/test_generate.py:
    • Add model_transformers_vision fixture. tests pass locally, but disabled because a model small enough for CI isn't available
    • Test all outlines.generate generators to ensure dispatchers for this new sequence generator is handled correctly.

@lapp0 lapp0 force-pushed the multimodal-models branch 2 times, most recently from e033200 to 6adb73b Compare July 19, 2024 15:34
@lapp0 lapp0 requested a review from rlouf July 19, 2024 15:43
assert re.fullmatch(pattern, res) is not None, res


@pytest.mark.parametrize("pattern", REGEX_PATTERNS)
@pytest.mark.skip(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a handful of open json validation issues. This is a good integration test case generally to address json generation failures because it applies random models to structured json generation.

```

Create convenience function to load a `PIL.Image` from URL
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```
```python

from pydantic import BaseModel
from typing import List, Optional

def img_from_url(url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function missing

@rlouf rlouf merged commit a7e3381 into dottxt-ai:main Jul 19, 2024
7 checks passed
@parkervg
Copy link
Contributor

Very excited about this update!

After pip installing main at 26e2934, I'm encountering an error trying out the example in the Transformers Vision Documentation.

Since I'm running on my GPU-less laptop, I'm using the tiny llava-hf/llava-interleave-qwen-0.5b-hf model, which is also a LlavaNextForConditionalGeneration model. I was able to recreate the below error with bczhou/tiny-llava-v1-hf as well.

Here's the full code snippet:

import outlines
from outlines.models.transformers_vision import transformers_vision

model = transformers_vision(
    'llava-hf/llava-interleave-qwen-0.5b-hf'
)
from PIL import Image
from io import BytesIO
from urllib.request import urlopen

def img_from_url(url):
    img_byte_stream = BytesIO(urlopen(url).read())
    return Image.open(img_byte_stream).convert("RGB")

description_generator = outlines.generate.text(model)
description_generator(
    "<image> detailed description:",
    [img_from_url("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg")]
)

And the full error I get, with transformers==4.43.3 and torch==2.2.2.

TypeError                                 Traceback (most recent call last)
Cell In[10], line 16
     13     return Image.open(img_byte_stream).convert("RGB")
     15 description_generator = outlines.generate.text(model)
---> 16 description_generator(
     17     "<image> detailed description:",
     18     [img_from_url("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg")]
     19 )

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/outlines/generate/api.py:555, in VisionSequenceGeneratorAdapter.__call__(self, prompts, media, max_tokens, stop_at, seed, **model_specific_params)
    549 prompts, media = self._validate_prompt_media_types(prompts, media)
    551 generation_params = self.prepare_generation_parameters(
    552     max_tokens, stop_at, seed
    553 )
--> 555 completions = self.model.generate(
    556     prompts,
    557     media,
    558     generation_params,
    559     self.logits_processor,
    560     self.sampling_params,
    561     **model_specific_params,
    562 )
    564 return self._format(completions)

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/outlines/models/transformers_vision.py:56, in TransformersVision.generate(self, prompts, media, generation_parameters, logits_processor, sampling_parameters)
     46 inputs = self.processor(prompts, media, padding=True, return_tensors="pt").to(
     47     self.model.device
     48 )
     50 generation_kwargs = self._get_generation_kwargs(
     51     prompts,
     52     generation_parameters,
     53     logits_processor,
     54     sampling_parameters,
     55 )
---> 56 generated_ids = self._generate_output_seq(prompts, inputs, **generation_kwargs)
     58 # if single str input and single sample per input, convert to a 1D output
     59 if isinstance(prompts, str):
     60     # Should always be true until NotImplementedError above is fixed

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/outlines/models/transformers.py:350, in Transformers._generate_output_seq(self, prompts, inputs, generation_config, **generation_kwargs)
    346 def _generate_output_seq(
    347     self, prompts, inputs, generation_config, **generation_kwargs
    348 ):
    349     input_ids = inputs["input_ids"]
--> 350     output_ids = self.model.generate(
    351         **inputs, generation_config=generation_config, **generation_kwargs
    352     )
    354     # encoder-decoder returns output_ids only, decoder-only returns full seq ids
    355     if self.model.config.is_encoder_decoder:

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/transformers/generation/utils.py:1989, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1981     input_ids, model_kwargs = self._expand_inputs_for_generation(
   1982         input_ids=input_ids,
   1983         expand_size=generation_config.num_return_sequences,
   1984         is_encoder_decoder=self.config.is_encoder_decoder,
   1985         **model_kwargs,
   1986     )
   1988     # 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 1989     result = self._sample(
   1990         input_ids,
   1991         logits_processor=prepared_logits_processor,
   1992         logits_warper=prepared_logits_warper,
   1993         stopping_criteria=prepared_stopping_criteria,
   1994         generation_config=generation_config,
   1995         synced_gpus=synced_gpus,
   1996         streamer=streamer,
   1997         **model_kwargs,
   1998     )
   2000 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   2001     # 11. prepare logits warper
   2002     prepared_logits_warper = (
   2003         self._get_logits_warper(generation_config, device=input_ids.device)
   2004         if generation_config.do_sample
   2005         else None
   2006     )

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/transformers/generation/utils.py:2932, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, logits_warper, **model_kwargs)
   2929 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
   2931 # forward pass to get next token
-> 2932 outputs = self(**model_inputs, return_dict=True)
   2934 if synced_gpus and this_peer_finished:
   2935     continue  # don't waste resources running the code we don't need

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/transformers/models/llava_next/modeling_llava_next.py:766, in LlavaNextForConditionalGeneration.forward(self, input_ids, pixel_values, image_sizes, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    763 # 2. Merge text and images
    764 if pixel_values is not None and input_ids.shape[1] != 1 and pixel_values.size(0) > 0:
    765     # ! infer image_num_patches from image_sizes
--> 766     image_num_patches = [
    767         image_size_to_num_patches(
    768             image_size=imsize,
    769             grid_pinpoints=self.config.image_grid_pinpoints,
    770             patch_size=self.config.vision_config.image_size,
    771         )
    772         for imsize in image_sizes
    773     ]
    774     # figure out if pixel_values is concatenated or stacked
    775     if pixel_values.dim() == 5:
    776         # stacking when input is (batch_size, num_patches, num_channels, height, width)

TypeError: 'NoneType' object is not iterable

Good (?) news is that it's at the transformers level, haven't had time to debug in much detail though. As a sanity check, I verified that the pipeline example from https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf successfully runs and produces generated output.

Not opening a stand-alone issue since this feature isn't apart of an official outlines release yet, but happy to create one if you'd prefer!

@lapp0
Copy link
Collaborator Author

lapp0 commented Jul 30, 2024

@parkervg I was able to reproduce your error. Issue was that it's trying to use LlavaNextForConditionalGeneration, but this doesn't work.

Could you please try setting the model and processor classes?

>>> model = transformers_vision(
...     'llava-hf/llava-interleave-qwen-0.5b-hf',
...     processor_class=transformers.AutoProcessor,
...     model_class=transformers.LlavaForConditionalGeneration,
... )
>>> prompt
'<|im_start|>user\n<image>\nWhat are these?<|im_end|>\n<|im_start|>assistant\n'
>>> 
>>> description_generator = outlines.generate.text(model)
>>> description_generator(prompt, [raw_image])
'They are kittens.'

We probably want to default to AutoProcessor and make model_class a required argument, with your error in mind. Could you open an issue to do so?

@parkervg
Copy link
Contributor

Thanks for the guidance, opened an issue and a corresponding PR here: #1077

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vision LLMs and Outlines Support for multi-modal models
3 participants