Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size mismatch loading Pixtral with LlavaForConditionalGeneration #33591

Open
2 of 4 tasks
RonanKMcGovern opened this issue Sep 19, 2024 · 0 comments
Open
2 of 4 tasks
Labels

Comments

@RonanKMcGovern
Copy link

System Info

  • transformers version: 4.45.0.dev0
  • Platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.25.0
  • Safetensors version: 0.4.5
  • Accelerate version: 0.34.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?: yes
  • GPU type: NVIDIA A40

Who can help?

@amyeroberts @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm running the exact code shown on this page:

from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "hf-internal-testing/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

IMG_URLS = [
    "https://picsum.photos/id/237/400/300",
    "https://picsum.photos/id/231/200/300",
    "https://picsum.photos/id/27/500/500",
    "https://picsum.photos/id/17/150/600",
]
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"

inputs = processor(images=IMG_URLS, text=PROMPT, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=500)
ouptut = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

EXPECTED_GENERATION = """
Describe the images.
Sure, let's break down each image description:

1. **Image 1:**
   - **Description:** A black dog with a glossy coat is sitting on a wooden floor. The dog has a focused expression and is looking directly at the camera.
   - **Details:** The wooden floor has a rustic appearance with visible wood grain patterns. The dog's eyes are a striking color, possibly brown or amber, which contrasts with its black fur.

2. **Image 2:**
   - **Description:** A scenic view of a mountainous landscape with a winding road cutting through it. The road is surrounded by lush green vegetation and leads to a distant valley.
   - **Details:** The mountains are rugged with steep slopes, and the sky is clear, indicating good weather. The winding road adds a sense of depth and perspective to the image.

3. **Image 3:**
   - **Description:** A beach scene with waves crashing against the shore. There are several people in the water and on the beach, enjoying the waves and the sunset.
   - **Details:** The waves are powerful, creating a dynamic and lively atmosphere. The sky is painted with hues of orange and pink from the setting sun, adding a warm glow to the scene.

4. **Image 4:**
   - **Description:** A garden path leading to a large tree with a bench underneath it. The path is bordered by well-maintained grass and flowers.
   - **Details:** The path is made of small stones or gravel, and the tree provides a shaded area with the bench invitingly placed beneath it. The surrounding area is lush and green, suggesting a well-kept garden.

Each image captures a different scene, from a close-up of a dog to expansive natural landscapes, showcasing various elements of nature and human interaction with it.
"""

Error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 5
      2 from PIL import Image
      4 model_id = "hf-internal-testing/pixtral-12b"
----> 5 model = LlavaForConditionalGeneration.from_pretrained(model_id,cache_dir='').to("cuda")
      6 processor = AutoProcessor.from_pretrained(model_id)
      8 IMG_URLS = [
      9     "https://picsum.photos/id/237/400/300",
     10     "https://picsum.photos/id/231/200/300",
     11     "https://picsum.photos/id/27/500/500",
     12     "https://picsum.photos/id/17/150/600",
     13 ]

File /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:3984, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   3974     if dtype_orig is not None:
   3975         torch.set_default_dtype(dtype_orig)
   3977     (
   3978         model,
   3979         missing_keys,
   3980         unexpected_keys,
   3981         mismatched_keys,
   3982         offload_index,
   3983         error_msgs,
-> 3984     ) = cls._load_pretrained_model(
   3985         model,
   3986         state_dict,
   3987         loaded_state_dict_keys,  # XXX: rename?
   3988         resolved_archive_file,
   3989         pretrained_model_name_or_path,
   3990         ignore_mismatched_sizes=ignore_mismatched_sizes,
   3991         sharded_metadata=sharded_metadata,
   3992         _fast_init=_fast_init,
   3993         low_cpu_mem_usage=low_cpu_mem_usage,
   3994         device_map=device_map,
   3995         offload_folder=offload_folder,
   3996         offload_state_dict=offload_state_dict,
   3997         dtype=torch_dtype,
   3998         hf_quantizer=hf_quantizer,
   3999         keep_in_fp32_modules=keep_in_fp32_modules,
   4000         gguf_path=gguf_path,
   4001     )
   4003 # make sure token embedding weights are still tied if needed
   4004 model.tie_weights()

File /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:4529, in PreTrainedModel._load_pretrained_model(***failed resolving arguments***)
   4525     if "size mismatch" in error_msg:
   4526         error_msg += (
   4527             "\n\tYou may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method."
   4528         )
-> 4529     raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
   4531 if len(unexpected_keys) > 0:
   4532     archs = [] if model.config.architectures is None else model.config.architectures

RuntimeError: Error(s) in loading state_dict for LlavaForConditionalGeneration:
	size mismatch for language_model.model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).

Expected behavior

I would expect the model to load normally. Something is off in the dimensions. Is there perhaps another model version on HuggingFace Hub with the correct config? Many thanks.

P.S. I had to uninstall flash attn, I assume that's just not supported, worth adding to docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant