Semantic Image Disassembler (SID): A VLM-Based Tool for Image Manipulation
Analysis
The Semantic Image Disassembler (SID) is presented as a versatile tool leveraging Vision Language Models (VLMs) for image manipulation tasks. Its core functionality revolves around disassembling images into semantic components, separating content (wireframe/skeleton) from style (visual physics). This structured approach, using JSON for analysis, enables various processing modes without redundant re-interpretation. The tool supports both image and text inputs, offering functionalities like style DNA extraction, full prompt extraction, and de-summarization. Its model-agnostic design, tested with Qwen3-VL and Gemma 3, enhances its adaptability. The ability to extract reusable visual physics and reconstruct generation-ready prompts makes SID a potentially valuable asset for image editing and generation workflows, especially within the Stable Diffusion ecosystem.
Key Takeaways
- •SID is a VLM-based tool for image manipulation.
- •It separates image content from style using JSON.
- •It supports style DNA extraction, prompt extraction, and de-summarization.
“SID analyzes inputs using a structured analysis stage that separates content (wireframe / skeleton) from style (visual physics) in JSON form.”