- Published on
Google has released PaliGemma, an open vision-language model that combines the SigLIP vision model and Gemma language model.
PaliGemma is designed for transfer learning to a wide range of vision-language tasks like image/video captioning, visual question answering, object detection/segmentation, and text reading.
It is a 3B parameter model – SigLiP + Gemma 2B, supporting images up to 896 x 896 resolution. Capable of document understanding, image detection, visual question answering, captioning and more.
The release includes models fine-tuned on various downstream tasks as well as code to use and fine-tune PaliGemma.
However, PaliGemma is an experimental research model, so caution is advised when using it for applications.