Skip to content
/ UNIT Public

PyTorch implementation of "UNIT: Unifying Image and Text Recognition in One Vision Encoder", NeurlPS 2024.

License

Notifications You must be signed in to change notification settings

yeezhu/UNIT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

UNIT: Unifying Image and Text Recognition in One Vision Encoder

Install

This project supports both NVIDIA and Ascend GPUs.

  • Dependencies & Environment

    • Python >= 3.9
    • NVIDIA GPU, CUDA >= 11.7
    • ASCEND NPU (Recommend to use 910B), CANN 8.0.0 RC1, torch-npu = 2.1.0
  • Install pytorch packages

pip install torch==2.1.0 
pip install timm==0.9.12 
pip install transformers==4.32.1

Usage

import torch
from PIL import Image
from transformers import CLIPImageProcessor

from unit import UNITModel

### uncomment to use Ascend NPU
# import torch_npu
# from torch_npu.npu import amp 
# from torch_npu.contrib import transfer_to_npu

# use UNIT_600M model
model_path = "/path/to/UNIT_600M/"
### uncomment to use UNIT_1B model
# model_path = "/path/to/UNIT_1B/"

model = UNITModel.from_pretrained(model_path)

model.to(device='cuda')
model.eval()

image_processor = CLIPImageProcessor.from_pretrained(model_path)

image = Image.open("test.jpg").convert('RGB')

image_input = image_processor(image)['pixel_values'][0]
image_tensor = torch.tensor(image_input).unsqueeze(0).to(torch.bfloat16).cuda()

with torch.set_grad_enabled(False):
    cls_tokens, spatial_tokens = model(image_tensor)

### Note: Applying a LayerNorm layer to these tokens is crucial before feeding them into LLMs.

Results

  • MLLM downstrean tasks
Method GQA OKVQA ChartQA DocVQA InfoVQA OCRBench POPE MME SEED-Image MathVista
CLIP-L 62.34 56.97 51.96 57.19 29.31 382 84.67 1503.60 69.79 42.7
SigLIP 63.02 61.06 56.48 61.97 29.70 429 85.93 1489.37 71.63 44.2
UNIT-600M 63.89 61.52 61.0 65.49 31.92 480 85.81 1529.76 72.81 44.6
UNIT-1B 64.90 56.78 66.64 71.34 34.81 540 87.54 1531.92 73.15 44.3

Citation

If you use the code in your research, please cite:

@INPROCEEDINGS{Zhu2024UNIT,
    author = {Zhu, Yi and Zhou, Yanpeng and Wang, Chunwei and Cao, Yang and Han, Jianhua and Hou, Lu and Xu, Hang.},
    title = {UNIT: Unifying Image and Text Recognition in One Vision Encoder},
    booktitle = {The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)},
    year = {2024}
}

About

PyTorch implementation of "UNIT: Unifying Image and Text Recognition in One Vision Encoder", NeurlPS 2024.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages