• Published on

    Appleā€™s MM1 is a multimodal large language model (MLLM) that can interpret both images and text data, developed by a team of computer scientists and engineers at Apple.

    The model is part of a family of multimodal models and is designed to improve capabilities in image captioning, visual question answering, and query learning by integrating text and image data. The largest model in MM1 is 30B and beats many 80B open-source LLMs in visual tasks. The family of multimodal models consists of both dense models and mixture-of-experts (MoE) variants.

    The MM1 model can count objects, identify objects that are part of an image, and use common sense about everyday objects to offer users useful information about what the image presents.

    It also has the ability to perform in-context learning, which means it does not need to start over every time a question is asked; it uses what it has learned in the current conversation.