When this project began, the industry was shifting fast.
Robin was already a machine-learning-heavy company. We had invested years into proprietary models trained to identify clauses, labels, and structured legal concepts across contracts. AI was not new to us, but LLMs were.
The release of GPT-3 changed expectations almost overnight. Natural language interaction went from experimental to inevitable, and like many companies operating in the AI space, there was real pressure to demonstrate that we were not being left behind.
The challenge was not whether to use LLMs.
It was how to introduce them without compromising trust, accuracy, or privacy in a legal product.
The Problem We Were Actually Solving
Query was already powerful, but it felt technical.
Users had to:
- Memorise filters and labels
- Think in a SQL-like mental model
- Click through multiple steps to answer simple questions
At the same time, early LLM experiments made something very clear:
- Token limits were real
- Full-contract reasoning was unreliable
- Hallucinations were unacceptable in legal workflows
We were not just designing a feature. We were navigating a new class of constraints.
The Decision
Rather than shipping a chat-first or Copilot-style experience, I proposed a hybrid model.
Users express intent in natural language.
The LLM interprets that intent.
Queries are executed through existing labels and filters.
Results remain explainable, fast, and auditable.
This gave users the flexibility they wanted without asking the LLM to answer questions it could not reliably support.
It also meant we could modernise the experience without destabilising the product.
Paths We Explicitly Did Not Take
We explored and intentionally rejected several alternatives.
Full contract ingestion
- Blocked by token limits
- High latency and cost
- Serious privacy and data isolation concerns
Chat-first or Copilot-style interface
- Impressive in demos
- Difficult to verify
- Misaligned with how legal professionals actually work
In a regulated environment, fluent answers without traceability are worse than no answers at all.
Validation and Real-World Constraints
Internally, the idea validated quickly.
I presented the approach to our AI engineers and CTO, and within hours we had a working prototype that proved feasibility. The real challenge came after launch.
Some enterprise clients had extremely large clause and value datasets that still exceeded available context limits. Rather than shipping a degraded experience, we made the call to temporarily disable the feature for those accounts.
It was a conscious trade-off.
Consistency over partial coverage.
Trust over novelty.
As LLM context windows expanded, we were able to re-enable the feature universally without redesigning the system, which reinforced the long-term strength of the approach.
Outcome
- 65% faster query creation
- 92% user preference for natural language input
- Zero hallucinations when constrained to structured fields
More importantly, users reported higher confidence in results because they could see and refine how queries were constructed.
Broader Impact
This work did not just ship a feature.
It established a pattern for how Robin could responsibly adopt LLMs.
By integrating LLMs alongside an existing ML stack rather than replacing it, the business could:
- Roll out AI features without exposing users to early limitations
- Launch beta capabilities safely
- Build on familiar workflows instead of reinventing them
The same foundations later enabled:
- Contract summaries
- Clause explanations
- Clause library suggestions
- Translation and cross-language search
Features that previously felt risky or infeasible became achievable because the system was designed to evolve with the technology.
Reflection
This project reinforced a core belief of mine.
Good AI design is not about how impressive a model looks.
It is about how safely and clearly it fits into real workflows.
By grounding LLMs in deterministic systems, we delivered innovation without sacrificing trust and created a model that aged well as the AI landscape continued to change.
A Critical Realisation
I spent a lot of time following LLM releases and speaking with our engineers about how companies like Microsoft were approaching Copilot. One pattern stood out.
LLMs worked best when they did not operate alone.
Instead of forcing a language model to reason over massive bodies of text it barely understood, successful systems paired LLMs with deterministic, structured foundations.
Robin already had that foundation.
We had:
- A mature labeling system
- ML models trained on clause detection
- Structured metadata that represented contracts far more efficiently than raw text
The opportunity was not to replace our system with an LLM.
It was to let the LLM translate human intent into something our system already understood.