• Published on

    OpenAI has announced GPT-4o, a new multimodal AI model that can reason across text, audio, and visual inputs to generate text, audio, or image outputs in real-time.

    GPT-4o matches GPT-4’s performance on text tasks while providing major improvements in multilingual understanding, audio processing with latencies as low as 232ms, and vision capabilities – all within a single model.

    It achieves state-of-the-art results on multimodal benchmarks while being 50% cheaper than GPT-4 in the API.

    GPT-4o represents a step towards more natural human-AI interaction by seamlessly integrating multiple modalities. Initial demos showcase GPT-4o’s abilities like real-time translation, multimodal dialogue, audio generation like singing, and visual understanding.