Size mismatch loading Pixtral with LlavaForConditionalGeneration #33591

RonanKMcGovern · 2024-09-19T12:23:31Z

System Info

transformers version: 4.45.0.dev0
Platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.25.0
Safetensors version: 0.4.5
Accelerate version: 0.34.2
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?: yes
GPU type: NVIDIA A40

Who can help?

@amyeroberts @ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I'm running the exact code shown on this page:

from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "hf-internal-testing/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

IMG_URLS = [
    "https://picsum.photos/id/237/400/300",
    "https://picsum.photos/id/231/200/300",
    "https://picsum.photos/id/27/500/500",
    "https://picsum.photos/id/17/150/600",
]
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"

inputs = processor(images=IMG_URLS, text=PROMPT, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=500)
ouptut = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

EXPECTED_GENERATION = """
Describe the images.
Sure, let's break down each image description:

1. **Image 1:**
   - **Description:** A black dog with a glossy coat is sitting on a wooden floor. The dog has a focused expression and is looking directly at the camera.
   - **Details:** The wooden floor has a rustic appearance with visible wood grain patterns. The dog's eyes are a striking color, possibly brown or amber, which contrasts with its black fur.

2. **Image 2:**
   - **Description:** A scenic view of a mountainous landscape with a winding road cutting through it. The road is surrounded by lush green vegetation and leads to a distant valley.
   - **Details:** The mountains are rugged with steep slopes, and the sky is clear, indicating good weather. The winding road adds a sense of depth and perspective to the image.

3. **Image 3:**
   - **Description:** A beach scene with waves crashing against the shore. There are several people in the water and on the beach, enjoying the waves and the sunset.
   - **Details:** The waves are powerful, creating a dynamic and lively atmosphere. The sky is painted with hues of orange and pink from the setting sun, adding a warm glow to the scene.

4. **Image 4:**
   - **Description:** A garden path leading to a large tree with a bench underneath it. The path is bordered by well-maintained grass and flowers.
   - **Details:** The path is made of small stones or gravel, and the tree provides a shaded area with the bench invitingly placed beneath it. The surrounding area is lush and green, suggesting a well-kept garden.

Each image captures a different scene, from a close-up of a dog to expansive natural landscapes, showcasing various elements of nature and human interaction with it.
"""

Error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 5
      2 from PIL import Image
      4 model_id = "hf-internal-testing/pixtral-12b"
----> 5 model = LlavaForConditionalGeneration.from_pretrained(model_id,cache_dir='').to("cuda")
      6 processor = AutoProcessor.from_pretrained(model_id)
      8 IMG_URLS = [
      9     "https://picsum.photos/id/237/400/300",
     10     "https://picsum.photos/id/231/200/300",
     11     "https://picsum.photos/id/27/500/500",
     12     "https://picsum.photos/id/17/150/600",
     13 ]

File /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:3984, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   3974     if dtype_orig is not None:
   3975         torch.set_default_dtype(dtype_orig)
   3977     (
   3978         model,
   3979         missing_keys,
   3980         unexpected_keys,
   3981         mismatched_keys,
   3982         offload_index,
   3983         error_msgs,
-> 3984     ) = cls._load_pretrained_model(
   3985         model,
   3986         state_dict,
   3987         loaded_state_dict_keys,  # XXX: rename?
   3988         resolved_archive_file,
   3989         pretrained_model_name_or_path,
   3990         ignore_mismatched_sizes=ignore_mismatched_sizes,
   3991         sharded_metadata=sharded_metadata,
   3992         _fast_init=_fast_init,
   3993         low_cpu_mem_usage=low_cpu_mem_usage,
   3994         device_map=device_map,
   3995         offload_folder=offload_folder,
   3996         offload_state_dict=offload_state_dict,
   3997         dtype=torch_dtype,
   3998         hf_quantizer=hf_quantizer,
   3999         keep_in_fp32_modules=keep_in_fp32_modules,
   4000         gguf_path=gguf_path,
   4001     )
   4003 # make sure token embedding weights are still tied if needed
   4004 model.tie_weights()

File /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:4529, in PreTrainedModel._load_pretrained_model(***failed resolving arguments***)
   4525     if "size mismatch" in error_msg:
   4526         error_msg += (
   4527             "\n\tYou may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method."
   4528         )
-> 4529     raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
   4531 if len(unexpected_keys) > 0:
   4532     archs = [] if model.config.architectures is None else model.config.architectures

RuntimeError: Error(s) in loading state_dict for LlavaForConditionalGeneration:
	size mismatch for language_model.model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).

Expected behavior

I would expect the model to load normally. Something is off in the dimensions. Is there perhaps another model version on HuggingFace Hub with the correct config? Many thanks.

P.S. I had to uninstall flash attn, I assume that's just not supported, worth adding to docs.

The text was updated successfully, but these errors were encountered:

RonanKMcGovern added the bug label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size mismatch loading Pixtral with LlavaForConditionalGeneration #33591

Size mismatch loading Pixtral with LlavaForConditionalGeneration #33591

RonanKMcGovern commented Sep 19, 2024

Size mismatch loading Pixtral with LlavaForConditionalGeneration #33591

Size mismatch loading Pixtral with LlavaForConditionalGeneration #33591

Comments

RonanKMcGovern commented Sep 19, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior