• Published on

    LLaVA (Large Language-and-Vision Assistant) is an end-to-end trained large multimodal model that combines a vision encoder and the Vicuna large language model to enable general-purpose visual and language understanding.

    With 158,000 unique language-image instruction-following samples, LLaVA achieves impressive chat capabilities that sometimes mimic the behaviours of multimodal GPT-4 on unseen images and instructions.

    LLaVA utilises a two-stage instruction tuning procedure to align features and fine-tune the model end-to-end for visual chat and science question answering applications. Early experiments show LLaVA yields an 85.1% relative score compared to GPT-4 on a synthetic multimodal instruction-following dataset.

    The developers have open-sourced the GPT-4 generated visual instruction tuning data, model and code base to support further research. LLaVA demonstrates the potential for large multimodal models to enable powerful visual-language understanding and reasoning capabilities.