Technical·7 min read

Why Legal AI Needs Dual-Model Inference

Most legal AI platforms run a single AI model. That model is either large enough for complex reasoning (but slow and expensive per query) or small enough for fast responses (but limited in analytical depth). This is the fundamental tradeoff in single-model architectures, and it forces compromises on every query.

Scrivly takes a different approach: dual-model inference, where two models with different capabilities work together.

The Single-Model Problem

Large language models have a well-understood tradeoff between size and speed. A 70-billion parameter model can handle nuanced legal reasoning, complex multi-step analysis, and sophisticated document synthesis — but each query consumes significant compute resources and takes longer to generate.

A smaller model (8 billion parameters, for example) responds faster and uses less compute, making it efficient for straightforward tasks. But it lacks the depth for complex legal reasoning.

Single-model platforms choose one end of this spectrum and live with the consequences. A platform using only a large model is thorough but slow and expensive per query. A platform using only a small model is fast but shallow.

How Dual-Model Inference Works

Scrivly's architecture uses two models working together. The system intelligently routes each task to the appropriate model based on complexity. Simple retrieval tasks, document indexing, and straightforward extraction go to the smaller, faster model. Complex reasoning, multi-document synthesis, nuanced legal analysis, and document drafting engage the larger model.

This isn't two separate products. It's a single system that automatically determines which capability each task requires and routes accordingly. You ask a question; the system determines the complexity and uses the right tool.

Why This Matters for Legal Work

Legal work spans an enormous range of complexity within a single matter. A litigator might need simple fact extraction ("What date was the contract signed?") followed immediately by complex analysis ("How does the force majeure clause interact with the termination provisions given the timeline of events?").

A single-model system handles both queries with the same resources — either over-provisioning for simple tasks or under-provisioning for complex ones. Dual-model inference matches resources to requirements, delivering speed on simple tasks and depth on complex ones.

The economic implications are significant for on-premise deployment. Because Scrivly Local runs on fixed hardware, efficient resource allocation means more throughput from the same appliance. Simple tasks don't consume the compute budget that complex tasks need.

The User Experience

From the attorney's perspective, the dual-model architecture is invisible. You ask questions and get cited answers. The routing happens automatically. The difference you notice is that simple queries return almost instantly while complex analysis takes appropriately longer — rather than every query taking the same amount of time regardless of complexity.

Frequently Asked Questions

Do I choose which model handles my query? No. The system routes automatically based on task complexity. The experience is seamless.

Does dual-model mean less accurate? No. Each model operates within its capability range. Simple tasks handled by the smaller model are well within its accuracy capabilities. Complex tasks are handled by the larger model with full analytical depth.

Is this the same as running two separate AI tools? No. It's an integrated architecture where both models share the same retrieval system, citation infrastructure, and document index. The routing is internal and automatic.

Can other legal AI platforms do this? Cloud-based platforms could theoretically implement multi-model routing, but most use a single model or model family. Scrivly's on-premise dual-model architecture is, to our knowledge, unique in legal AI.

Frequently Asked Questions

Dual-model inference uses two AI models with different capabilities. Simple tasks use a smaller, faster model. Complex reasoning uses a larger, more capable model. Routing is automatic.

No. The system routes automatically based on task complexity. The experience is seamless.

No. Each model operates within its capability range. Simple tasks are within the smaller model's accuracy. Complex tasks receive the larger model's full depth.

Related Articles

Your clients' confidentiality is not negotiable. Your AI shouldn't be either.

See how Scrivly handles your firm's use cases.