I Fine-Tuned an LLM Into C-3PO to Find the Best Way to Actually Inject a Persona

Persona injection sounds like it should be solved by now. Just write a good system prompt, right? Slap "You are C-3PO, a protocol droid fluent in over six million forms of communication" at the top, and watch the magic happen. Except it doesn't—not reliably, not at depth. The model plays dress-up. It wears the costume without inhabiting the character.

So I ran a proper experiment. Same base model, same LoRA configuration, 500 training examples per condition, three different data formats. The question: which training format actually produces a model that is a persona, rather than one that merely describes it?

The Three Formats

This isn't a vibe test. Each format was deliberately chosen to represent a different theory of how persona gets encoded in weight space.

Format 1: Chat Demonstrations

The intuitive choice. Multi-turn dialogues where C-3PO interacts with humans in character—expressing his particular flavor of fussy anxiety, quoting improbable odds, lamenting his situation. Standard RLHF-style conversational data. The assumption is that the model learns the persona through behavioral imitation.

Format 2: First-Person Declarative Statements

Flat, direct statements written in the voice of the character. "I am C-3PO. I am programmed for etiquette and protocol. I find most situations cause me considerable distress." No dialogue structure, just identity assertions. This format feels almost too simple, which is exactly why testing it matters.

Format 3: Synthetic Wikipedia-Style Documents

Third-person descriptive documents about C-3PO—his behavioral traits, emotional tendencies, speech patterns—written in an encyclopedic register. The hypothesis here is that factual knowledge about a persona's traits should generalize to behavioral expression of those traits.

What Actually Happened

The chat demos performed roughly as expected: decent in-distribution behavior, somewhat brittle on novel prompts. The model knew how to play C-3PO when the situation resembled its training data, less so when things got weird.

The first-person statements won on generalization. This is the result I didn't see coming. Simple identity declarations outperformed structured behavioral demonstrations at transferring the persona to new contexts. The working theory: identity assertions may more directly update the model's "self-concept" representations—whatever those actually are in transformer weight space—rather than teaching surface-level conversational patterns.

The synthetic document model produced the most interesting failure mode. It knew C-3PO was anxious—it could tell you this, accurately, in descriptive terms—but it only expressed that anxiety about 37% of the time.

That 37% figure deserves unpacking. Knowing a trait and expressing a trait are apparently distinct phenomena in how these models encode information. The Wikipedia-style training produced something like declarative memory about the character without reliably encoding it as procedural behavior. The model had C-3PO in its knowledge base, not in its identity.

Why This Actually Matters (If You're Building Agents)

If you're working on character AI, customer-facing personas, or any application where consistent behavioral identity is a product requirement—this distinction is not academic. Your fine-tuning data format is a design decision, not an afterthought.

Don't assume factual training data produces behavioral output. Third-person descriptions about how an entity behaves are not equivalent to training examples of that entity behaving. The model encodes these differently.
First-person assertions are underrated. They feel naive, but the data suggests they may more directly target the representations that govern self-expressive behavior. Worth testing in your domain before defaulting to the more "sophisticated" approach.
The knowledge-behavior gap is real. A model scoring perfectly on "does it know X is trait Y" benchmarks can still fail to exhibit trait Y under generation. This is a known problem in alignment research that apparently also shows up in persona fine-tuning. Calibrate your eval accordingly.

The Caveats You Deserve

500 examples per condition is a reasonable starting point for controlled comparison, not a definitive sample size for production conclusions. LoRA configuration choices will interact with your results—rank, alpha, which layers you're targeting all matter and weren't varied here. And C-3PO is a character with extremely high corpus representation in pre-training data, which almost certainly boosted all three conditions. Try this with a low-data persona and you'll likely see larger performance gaps between formats.

Also, "generalization" was measured on out-of-distribution prompts that are still Star Wars-adjacent. True generalization—does the persona hold up when the conversation is about something completely unrelated?—is the harder and more important benchmark for production use cases.

The Takeaway

The synthetic document approach is the most seductive because it seems rigorous. You're encoding factual knowledge systematically. But facts about behavior and behavioral encoding are not the same thing. The model that can describe C-3PO's anxiety with accuracy is not the same as the model that is anxious like C-3PO.

First-person declarative training shouldn't win on paper. But the weights don't care about your priors. Test your assumptions, especially the ones that seem too obvious to question.

Code and full methodology are in the GitHub repo linked below. Run it, break it, tell me where I'm wrong.