cover of episode Meet Leopard: The AI That Excels in Multi-Image, Text-Rich Tasks

Meet Leopard: The AI That Excels in Multi-Image, Text-Rich Tasks

2024/11/6
logo of podcast The Quantum Drift

The Quantum Drift

Shownotes Transcript

In this episode, Robert and Haley explore the latest breakthrough in AI multimodal models: Leopard, a new AI developed to tackle complex, text-rich image tasks. Designed by researchers from the University of Notre Dame, Tencent AI Seattle Lab, and UIUC, Leopard is the first model to truly excel at understanding and reasoning across multiple text-heavy images, like presentation slides, web snapshots, and scanned documents.

Join us as we break down how Leopard’s adaptive high-resolution multi-image encoding and innovative pixel shuffling set it apart from traditional models. Unlike its predecessors, Leopard can keep high-resolution details without sacrificing accuracy, meaning it’s primed for real-world uses like analyzing multi-page reports, data charts, and visual presentations. We discuss:

  • Leopard’s Unique Dataset: A tailored instruction-tuning dataset of over a million data points.
  • Dynamic Encoding: How Leopard keeps crucial details while managing multiple images at once.
  • Performance Gains: Over 9% improvement on benchmarks like SlideVQA and Multi-page DocVQA.

Get ready to dive into how this model reshapes the landscape for AI in business, education, and research. Leopard just might be the game-changer multimodal AI has been waiting for!