Search:
Match:
1 results
Research#llm🏛️ OfficialAnalyzed: Dec 24, 2025 11:49

Google's ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Published:Mar 19, 2024 20:15
1 min read
Google Research

Analysis

This article introduces ScreenAI, a novel vision-language model designed to understand and interact with user interfaces (UIs) and infographics. The model builds upon the PaLI architecture, incorporating a flexible patching strategy. A key innovation is the Screen Annotation task, which enables the model to identify UI elements and generate screen descriptions for training large language models (LLMs). The article highlights ScreenAI's state-of-the-art performance on various UI- and infographic-based tasks, demonstrating its ability to answer questions, navigate UIs, and summarize information. The model's relatively small size (5B parameters) and strong performance suggest a promising approach for building efficient and effective visual language models for human-machine interaction.
Reference

ScreenAI improves upon the PaLI architecture with the flexible patching strategy from pix2struct.