AutoResearch on Our Own Product

Draft, deploy-gated. The full narrative publishes with the Lab launch; the sections below are content-free by construction.

It is easy to claim a product improves over time. It is harder to show your work. AutoResearch runs measurable experiments against our own system, digest relevance scoring, knowledge-base retrieval thresholds, document-ingest dedupe, approval-card copy, and more, and publishes what it kept, what it discarded, and why.

Every experiment has the same anatomy

We hold each experiment to the same structure so the gallery is comparable, not anecdotal:

Objective metric, the single number the experiment is trying to move.
Baseline, what the metric was before any change.
Variants, the alternatives under test.
Resource band, a coarse, public-safe indication of what the experiment cost to run.
Decision, an explicit keep-or-discard call, recorded, not implied.

A change that does not beat its baseline gets discarded, and the discard is part of the public record. Negative results are results.

Running experiments against ourselves

The experiments target the parts of the product where quality is measurable:

Whether a retrieval threshold returns better chunks.
Whether a dedupe rule actually collapses duplicate forwarded documents.
Whether a relevance-scoring change improves digest ranking.
Whether a rewording of an approval card changes the approve, deny, or correct mix.

Pointing the experiment harness at our own surfaces keeps us honest: the same discipline we ask of any governed agent, baseline, variant, metric, decision, is the discipline we hold ourselves to.

What the gallery exposes

The experiment gallery is content-free by construction. It shows the metric, baseline, variants, coarse resource bands, and the keep-or-discard decision. It never publishes internal runtime details, raw prompts, or account mechanics. You see the shape of the experiment and its outcome, not the machinery underneath.

Browse the experiment gallery → /lab/experiments