Peripheral Attention Engineering: Teaching LLMs What Humans Know to Look For

    ┌──────────────────────────────────┐
    │         QUERY:                   │
    │  "What's your return policy?"    │
    │                                  │
    │  Without context:                │
    │    "30 days with a receipt."     │
    │                                  │
    │  With PAE:                       │
    │    VIP 6yr, $84K lifetime        │
    │    Holiday season, 47% returns   │
    │    $12K recent order             │
    │    => "Let us waive that for     │
    │       you, we'll send a label."  │
    └──────────────────────────────────┘

    The answer should not be the same for everyone.

I've been in customer support for most of my career. Restaurants, marketing agencies where I was client-facing, a baby brand I ran solo -- I've handled the complaints, the returns, the "what's your policy?" questions from real humans with real situations. What you learn fast in that line of work is that the same question can have wildly different right answers depending on who's asking and what's happening around them.

Working in AI now, with a specialty in infrastructure and RAG pipelines, I'm acutely aware of how context slips through the cracks with machines. They simply don't know what they don't know.

This is the problem Peripheral Attention Engineering was built to solve.

The Return Policy That Wasn't Wrong But Was Wrong

Consider a customer asking "What's your return policy?" An LLM with the policy document answers correctly: "30 days with a receipt." Technically flawless. But a competent human agent, seeing that the customer is a six-year VIP with a $12,000 recent order, that it's December 28th during peak holiday chaos, and that their return rate has tripled to 47% in the past 90 days -- gives a fundamentally different answer.

The literal policy is the same. The answer should not be.

Before I had the language for PAE, I would have just called this personalized care. Every experience is different, and every person's circumstances are different. To foster the best relationships, you have to consider those personal details. That's the work. The problem is that LLMs aren't built to do it.

What LLMs Are Missing

I don't think this is really a flaw in how LLMs are trained. AI is trained to do what it's expected to do -- find appropriate context and use it to predict the most correct possible sequence of words. It does that remarkably well.

The problem is that if we as humans want to deploy AI to deal with other humans, it needs to learn a new skill: considering the personal circumstances of the people communicating with it. The major chatbots sort of do this already, but by accident, because they have access to entire chat histories. Your company's support bot won't have that luxury. So teaching it to be mindful of peripheral context becomes essential.

The terminology matters here. Originally I called this "nuance," but that wasn't quite right. This context actually is irrelevant -- just to the present query. But that doesn't mean it's irrelevant to the situation as a whole. If you're zoomed into a problem, this information exists in the periphery. Just calling it "additional context" conjures everything we already understand about RAG: make the retrieval chunks bigger, increase the chunk number, problem solved. But this is not that.

The Three Ingredients: Relevance, Reach, and Taste

The problem decomposes into three parts. Peripheral relevance -- useful context exists around the asker, the situation, the entities involved. Contextual reach -- the unprompted instinct to retrieve that context before answering. And taste -- the commitment to a specific framing and tone shaped by what you found.

Taste is the hardest one. This is where uniquely human concepts like humor live. AI is getting more advanced, but because it's a prediction machine, it will never truly act as a human would. It can't be unexpected or chaotic. It can't live in the space where the unknowable is embraced. That's the realm of taste, and I'm not convinced an LLM can be trained on it.

So I focused on contextual reach -- the simplest path and the most direct answer to the question I was asking: how can I get AI to handle human-to-business interactions with appropriate care?

Diagram showing peripheral context signals orbiting around a central query -- who's asking, their history, situational signals, and anomalies

The Experiment: Isolating What Actually Matters

The experimental design started with the obvious controls. Account for token length. Rule out obvious confounders.

But the expanded direct condition came later. I had already run tests on four models across several conditions when I thought: what if these results could be achieved by simply including relevant context of any kind? How would I know the quality answers weren't just coming from connecting the LLM to a robust vector store? If I could show that the improvement came specifically from peripheral, asker-focused context, then I could be convinced PAE was a real innovation.

That's the six-condition design: baseline, flat peripheral context, slotted peripheral context, anomaly-flagged context, length-matched noise, and expanded direct. The noise condition rules out "just more tokens." The expanded direct condition -- domain-relevant but asker-agnostic filler text -- rules out "just more domain knowledge."

I used three different model families -- DeepSeek, Gemma, and MiniMax -- with a fourth family (Gemini) as the judge. If PAE only works on the biggest frontier models, it's somewhat helpful. If it works regardless of the LLM, it's really helpful. And I wanted to avoid same-family bias in the judge.

The whole experiment cost under $10. That was important to me. I run a lot of tests and the costs add up fast. But more than that, I think it's meaningful when an advancement in the field can be tested and improved upon with a few bucks and a low-powered laptop. It lowers the barrier of entry, which is something I always aim for.

What the Results Actually Say

The headline: structured peripheral context improves response quality by roughly 2.5x. But a few findings surprised me.

I expected the legal advisory domain to show the smallest effect -- laws are laws, the domain seems rigid. Instead, it showed the largest gain of any domain. Asker-specific peripheral context being such a powerful lever on output quality in legal scenarios was genuinely unexpected.

The expanded direct condition confirmed what I suspected but needed to prove: adding more domain knowledge without entity-specific info does nothing. The entire premise of context engineering and RAG rests on the idea that more context leads to richer outputs. It's a knee-jerk reaction in a nascent field. But this goes to show that, just as with regular prompting, quality trumps quantity every time.

The structure effect -- slotted context outperforming flat context -- was small by conventional standards (d=0.13). But it was absolutely worth the engineering effort. Here's why: remember that the flat condition already contained the same information. The fact that structure added anything at all, when the information was already present, tells you it matters. And the Phase 2 result sealed it -- the fine-tuned model performed best on the flat condition precisely because it was trained to grab structure out of unstructured data. Structure helps un-tuned models; training internalizes the structure. Either way, structure wins.

Bar chart comparing six experimental conditions -- baseline, noise, domain-only knowledge, flat context, slotted context, anomaly-flagged, and fine-tuned model performance

Phase 2: Contextual Reach Is Trainable

Phase 2 was not always the plan. It emerged after seeing the power of Phase 1. The thinking was simple: we have easy access to powerful open-weight models now, and the concept here is simple -- just adjusting behavior -- so we should be able to train this.

The fine-tuned Gemma 4 12B model achieved 100% context-request and context-use rates on held-out scenarios. It outperformed the average of larger, untuned models on every condition. And its best performance was on the flat peripheral condition -- unstructured text that base models struggled with. It learned to impose its own structure internally.

That last point matters because it shows why structure is important. If the model hadn't been fine-tuned, it would continue to struggle with unstructured data. The fact that this model can handle unstructured data only validates why structure matters -- it imposes its own. Once we know that, we can improve model outputs across many different scenarios and domains.

I believe the industry is too focused on scaling up. Consumers are too enamored with big frontier models, so there isn't demand yet for small, niche-specific fine-tuned models. The space is simply too new. But I predict that in five or ten years, fine-tuned, skill-specific models will be the norm for businesses and for many individuals. Eventually, when people can pause for breath, they'll see the power of small models trained on specific cognitive skills.

The Practical Takeaway: Structure Your Context

If you're building an LLM-powered product and you want to adopt PAE today, the simplest first step is: structure your data. Build labels into slots.

[WHO'S ASKING]
- VIP tier, 6-year customer, $84K lifetime spend

[WHAT'S HAPPENING]
- Dec 28, peak holiday returns period

[WHAT'S UNUSUAL]
- Return rate: 47% (6x normal)
- Recent order: $12K bulk purchase

That's it. Put the entity history, situational signals, and anomaly flags in clearly labeled sections. Flag the surprising stuff with explicit attention markers. By structuring the input with proper labeling, outputs improve -- and the cost is near-zero.

The pae.context.wrap() library is deliberately thin -- zero dependencies, a single function call. That was a philosophical choice. Make it so simple that not adopting it is harder than adopting it.

What This Means Bigger Picture

Frontier models will continue to scale, probably until we get close to AGI. Engineering context isn't a shift away from scaling -- it's a complement to it, especially when dealing with smaller models. But it also levels the playing field. Not every business can afford frontier-model token costs, but small, well-engineered models can be a genuine advantage for smaller enterprises.

If contextual reach becomes standard -- every LLM-powered system automatically pulling peripheral context before answering -- a lot of the benefits will be invisible. Communication with chatbots will be more efficient, resolution times will drop, and hallucinations will decrease, even if people mostly can't recognize hallucinations or know to hope they don't happen. It's one of those things where the entire experience improves, but people won't realize what it was like before. They'll just raise the bar for their expectations.

The conviction I want people to walk away with is this: AI can be an effective tool for interacting with humans on a business level in a way that doesn't leave them unhappy with the experience. That sounds modest. It isn't. It's the thing most AI deployments in customer-facing roles are failing at right now -- and it's fixable.

Research & Resources: