Every piece of text you copy has two layers. The one you read, and the one you don't. The second layer is made of Unicode characters that render as nothing at all: they occupy no pixels, but they are real bytes that travel with the text through every copy and paste.
The report is ready to ship.
what your computer seesThe reportZWSP is readyNBSP to ship.WJU+2800
Most of the time you never notice them. Then a search that should match doesn't, a form says a field is too long when it looks fine, a build fails on a line that looks identical to the working one, and you lose an hour to a character you cannot see.
This is a field guide to that layer: what lives there, why it exists, and when it is junk.
The main families
Zero-width characters. The classic is the zero-width space, U+200B, an invisible word-break hint. Its siblings are the zero-width non-joiner (U+200C) and zero-width joiner (U+200D), which control how neighboring characters connect. We wrote a whole post on the zero-width space, because it causes more mystery bugs than the rest combined.
Space impostors. Unicode has more than a dozen characters that look exactly like a space but aren't one: the non-breaking space (U+00A0), the narrow no-break space (U+202F), thin and hair spaces, the em quad, the ideographic space. They break string matching because "hello world" with a normal space and "hello world" with a non-breaking one are different strings.
Directional marks. Left-to-right and right-to-left marks (U+200E, U+200F) and their embedding and isolate relatives steer how mixed-direction text lays out. Essential for Arabic and Hebrew; stray noise everywhere else. The right-to-left override (U+202E) is famous for a nastier trick: it can make invoice_exe.pdf display as invoice_fdp.exe reversed, which is why security tools flag it.
Control and format characters. The C0 and C1 control blocks, the byte-order mark (U+FEFF), the word joiner (U+2060), invisible math operators, and the soft hyphen (U+00AD), which only becomes visible when a line happens to break at it.
Tag characters. A block at U+E0000-U+E007F that mirrors ASCII invisibly. Its one legitimate modern use is inside flag emoji. Outside of that, a sequence of tag characters is invisible ASCII riding along with your text: an ideal place to hide a fingerprint.
The braille blank. U+2800 looks like a space but is technically a braille pattern with no dots raised. It slips past "remove whitespace" logic because it is not classified as whitespace, which is why it shows up in "blank" usernames and invisible messages.
Where they come from
Almost never from typing. They arrive in text through:
- Word processors and CMSs that quietly insert non-breaking spaces and soft hyphens for layout.
- Websites that inject zero-width characters to defeat copy-paste scrapers, or to fingerprint which subscriber leaked a document. Adding an invisible, per-recipient pattern to text is a documented watermarking technique.
- AI assistants and chat interfaces, which have their own typographic habits. Some model outputs have carried unusual invisible characters; we dug into that story in Does ChatGPT watermark its text?
- PDFs and e-books, where extraction tools translate layout artifacts into spacing characters.
When invisible characters are legitimate
This matters, because "delete everything invisible" is the wrong answer. Zero-width spaces mark word boundaries in Thai, Khmer, Lao and Myanmar. Joiners shape Arabic and Indic scripts, and they assemble emoji: a family emoji is several people glued together with zero-width joiners. Directional marks keep Hebrew punctuation on the correct side. The non-breaking space is correct French typography before a colon.
A cleaner that strips all of it corrupts real writing in a dozen languages and shatters emoji. The right behavior is contextual: remove what is junk for your language, preserve what is meaning. That is the entire design premise of CopyClean, which ships locale rules for exactly this reason.
How to check your own text
If you suspect a string is carrying passengers, you have options: a hex dump, an editor that renders invisibles, or a paste into a Unicode inspector. We compared the practical ways in How to see hidden characters in any text. And if you would rather never think about it again, that is what an automatic clipboard cleaner is for.