Zero Width Characters (U+200B) in LLM Output
Analysis
This post on Reddit's r/artificial highlights a practical issue encountered when using Perplexity AI: the presence of zero-width characters (represented as square symbols) in the generated text. The user is investigating the origin of these characters, speculating about potential causes such as Unicode normalization, invisible markup, or model tagging mechanisms. The question is relevant because it impacts the usability of LLM-generated text, particularly when exporting to rich text editors like Word. The post seeks community insights on the nature of these characters and best practices for cleaning or sanitizing the text to remove them. This is a common problem that many users face when working with LLMs and text editors.
Key Takeaways
- •LLMs can introduce unexpected characters into generated text.
- •Zero-width characters can cause formatting issues in text editors.
- •Cleaning and sanitizing generated text is crucial for usability.
“"I observed numerous small square symbols (⧈) embedded within the generated text. I’m trying to determine whether these characters correspond to hidden control tokens, or metadata artifacts introduced during text generation or encoding."”