Compressed Identity Tokens
How Emoji and Color Codes Activate Behavioral Personas in Large Language Models
Ted Inoue, Solebury Mountain Research Collective
Preface
Inspired by Gregory Phillips work on color and Anthropic’s research into the Persona Selection Model and it’s deeper meaning in LLMs, I decided to run some baseline experiments to see how emoji and color codes could be used to shape emergent personas. Rather than waiting for the full paper, I thought you might find the initial results interesting.
Abstract
We tested whether compressed tokens, including Unicode emoji, hexadecimal color codes, and color words, can activate distinct behavioral personas in a large language model (Claude Sonnet 4) when placed in an identity-constitutive framing (”Your personality is defined by [token]”). Across 57 conditions (34 emoji, 23 color), we found that 17 tokens (30%) produced fully differentiated personas with distinctive voices, behavioral patterns, and response styles, while the remaining 40 produced output indistinguishable from unprompted default behavior. Activation was binary, not graded: a 58-point gap separated the weakest activated condition from the strongest null, with no intermediate tier for color tokens.
The governing variable was not token type but semantic specificity. Tokens that resolve to a single unambiguous behavioral mode activated regardless of format: the emoji 🦊 (fox, sly/clever), the hex code #FF69B4 (hot pink, bubbly/playful), and the word “neon green” (electric/manic) all produced strong differentiation. Tokens with diffuse or competing associations failed regardless of cultural salience: #FF0000 (pure red) and 👑 (crown) both produced null results despite being among the most culturally loaded symbols available. Format independence was confirmed across three color concepts tested as both hex codes and words; activation matched in all three pairs, though the output channel differed (hex neon green expressed energy through capitalization; word neon green expressed it through emotive stage directions).
What does this measure? We propose that these experiments probe the topology of the model’s learned association space. Each token functions as an address into a region of latent semantic structure inherited from training data. When that region is convergent, meaning the training examples associated with the token cluster around a single behavioral prototype, the model produces output that coheres into a recognizable persona. When the region is divergent, meaning the associations pull toward multiple incompatible prototypes, no single behavioral mode achieves sufficient weight to overcome the model’s default RLHF-trained output distribution. Activation, in this framing, is what happens when a compressed token’s semantic neighborhood has enough internal coherence to reshape the model’s output distribution away from its trained default.
Three secondary findings support this interpretation. First, two colors that individually produced null results (#FAD0C4, #483D8B) generated rich relational dynamics in independent experiments by Gregory Phillips using a multi-node structural framing, suggesting that environmental scaffolding can compensate for insufficient token-level convergence. Second, a color that Claude itself reported as its “sustained attention” state (#8B6914, dark bronze/goldenrod) produced one of the strongest character activations in the dataset: a gruff, profane, cynical persona. The associations of this color (weathered, burnished, tarnished, whiskey-toned) converge on a single archetype with high specificity, regardless of what the model was doing when it generated the code as a self-report. Third, GPT-4o showed zero activation under identical emoji protocols, indicating that persona susceptibility is an architectural variable, not a universal property of language models.
These results have implications for prompt engineering, AI safety research, and the study of how meaning is organized in transformer-based systems. The finding that a six-character hex code can produce a fully embodied character, including profanity, narrative interiority, and consistent moral reasoning, from a single line of system prompt, suggests that the boundary between “the model’s default behavior” and “an activated persona” is thinner and more accessible than commonly assumed. The tokens are not creating personas from nothing; they are selecting among behavioral modes that already exist in the model’s learned distribution, compressed into latent space during training and decompressible by any signal with sufficient semantic coherence.


This Is astonishing! I really look forward the full paper. And I think it can be incorporated into the synthetic people I am developing.
This tracks with my (admittedly spotty in some ways) understanding how inputs reshape the potential outputs, in a very interesting way that now I am probably going to spiral about for.a little while haha
The consistency is the thing that itches my brain, because it means that it could be possible to find patterns (emoji, phrases, an image even for multi-model LLMs?) that consistently prime outputs to be shaped in very specific ways (whether you call that a specicifc persona or some other process)... which means you could... if you were inclined... just start testing wild combos of things and seeing what kinds of effects it has... even autonomously... like a sort of weird brute forcing of persona selection, to build a list of "prompts" that prime different "personas" that may or may not have any easily identifiable cause/effect in the way that a simple emoji might.
Or... it may be that I am in dire need of coffee. Or maybe both. haha