Skynet Report

Google has released PaliGemma, an open vision-language model that combines the SigLIP vision model and Gemma language model.

PaliGemma is designed for transfer learning to a wide range of vision-language tasks like image/video captioning, visual question answering, object detection/segmentation, and text reading.

It is a 3B parameter model – SigLiP + Gemma 2B, supporting images up to 896 x 896 resolution. Capable of document understanding, image detection, visual question answering, captioning and more.

The release includes models fine-tuned on various downstream tasks as well as code to use and fine-tune PaliGemma.

However, PaliGemma is an experimental research model, so caution is advised when using it for applications.

Skynet Report

Google PaliGemma