rederive.dev

rederive: A Constraint-Based Development Platform

Version 0.4, May 2026. Author: Jared Foy. Domain: rederive.dev.


Abstract

This document introduces rederive, a development platform built on a single structural inversion. Where conventional software development treats source code as the artifact under version control and tests, types, and documentation as ancillary, rederive treats the specification — the structured set of requirements a program must satisfy — as the artifact under version control, and treats source code as a regenerable cache derived from that specification.

Two industry essays in March 2026 named the situation that motivates this work. Philip Su (No More Code Reviews: Lights-Out Codebases Ahead) argued that code review at the volume modern code-generating language models produce is on track to be unworkable. Hugo Venturini (Treat Agent Output Like Compiler Output) argued that the right response is the same response the field made fifty years ago when nobody read compiled binaries any more: build the apparatus around the artifact (type systems, tests, sanitizers, fuzzers, formal verification, monitoring, rollback) until the artifact does not need to be read. Both essays correctly identified the destination and the missing apparatus. Neither specified the operational form of that apparatus.

rederive is one operational form. It puts specifications under version control, treats source code as a derived materialization of those specifications, runs an explicit eight-stage build pipeline whose verification surface gates acceptance, signs each materialization with cryptographic provenance, and distributes specifications across machines via a content-addressed wire protocol. The platform commits explicitly to a small set of ethical and structural disciplines (operational and metaphysical) embedded into its verification surface, so that the disciplines are auditable rather than implicit. To the author's knowledge, rederive is the first integrated platform of this shape; the components on which it builds are well-known, but the integration into a single working development platform with all the supporting apparatus is original to this work.

This document describes the platform's architecture, its specification-authoring discipline, its verification surface, its identity and distribution layers, the alignment commitments built into it, and several structural readings that fall out of the platform once the inversion is taken seriously: a theory of where bugs live, an architectural reading of which language-model architectures support the discipline best, and a framework for thinking about what an autonomous code-generating agent would have to be capable of in order to operate without human direction. The platform exists at small scale and is reproducible locally; the conceptual scope of what it makes operational is substantially larger.


1. The Platform's Central Claim

A development platform's substrate has, since the early days of computing, been the source code itself. Repositories track files; commits are diffs over file contents; reviewers read those diffs; tools assist authoring of new file contents; deployments package those file contents; debuggers operate on those file contents. The whole working surface is organized around code as the durable artifact.

This made sense when typing the code was the slow part. For thirty years, an engineer's productive output was bottlenecked by the rate at which they could write and modify source files. Specifications drifted because nobody had time to keep them current; tests were written when there was time; documentation was a separate project run by separate people. The shape of the daily craft (repository structure, code review, ticket size, release cadence) was organized around the assumption that producing the implementation was what consumed the time.

In the last three years that assumption has stopped holding. Code-generating language models can produce plausible source code in seconds, at volumes no team can review. The rate-limit on engineering work has migrated up one layer: from the implementation, where it has lived for three decades, to the specification, where it always lived in principle but never had to live in practice. The cost of being precise about what the program is supposed to do went down at exactly the moment the cost of writing the program stopped being the rate-limit.

rederive takes this seriously. The platform's central claim:

The specification is the durable artifact. Source code is a regenerable materialization of the specification. The platform's working surface is organized around the specification, with code generation, verification, signing, and distribution all derived from and traceable back to the specification under content-addressed identity.

When a specification changes, the platform regenerates the code. When the code-generating model changes (a better model becomes available, an existing model is tuned, the team switches providers), the platform regenerates against the new model and verifies the result against the unchanged specification; if the new result passes verification, it is acceptable. When two engineers want to merge their work, they merge specification diffs (one page) rather than code diffs (five thousand lines). When a reviewer needs to understand what a change does, they read the requirement that moved, not the implementation that resulted. When a regulator or auditor needs provenance for a deployed artifact, they trace from the artifact through its materialization record back to the specification under which it was generated.

This is the platform's structural commitment. Everything else (the build pipeline, the verification backends, the wire protocol, the alignment surface) follows from it.


2. The Apparatus the Compiler Has That Code Generation Does Not

The compiler analogy is exact and worth dwelling on, because it tells us what apparatus is actually missing.

A compiler takes source code in a programming language and produces a binary the processor executes. The binary is, in any reasonable sense, much larger and more numerous than the source: thousands of binary instructions for a few hundred lines of code. No engineer reads compiled binaries to verify they are correct. No team allocates review time to it. No process treats the binary as the place where bugs live. This is not because we trust the compiler naively. It is because we built, over decades, a layered apparatus around the source code that makes reading the binary unnecessary:

This layered apparatus is what trust in the compiler actually consists of. The trust is not in the compiler; it is in the apparatus. The compiler itself can have bugs (and has had famously catastrophic ones); the apparatus catches them. The artifact (the binary) gets to be uninspected because the surface that surrounds it is mature.

A code-generating language model is, structurally, in the same position as a compiler: it takes input from the engineer and produces an artifact (source code) that needs to be trusted. The current situation is that the apparatus around the language model is missing. There are no type systems that operate at the requirement level. There are no static analyzers for specifications. There is no notion of reproducible builds at the specification layer. There are tests, but the tests are written after the code is generated, against the generated code rather than against the specification. The artifact (generated source code) is the only place trust currently lives, and the engineer's resistance to lights-out development is the felt presence of an apparatus that has not been built.

rederive is one form of that apparatus. The remainder of this document describes its components.


3. The Specification-Authoring Discipline

A specification in rederive is a structured Markdown document with extension .constraints.md. The format is small. A reviewer can sit down and read one in any text editor without tooling. The platform parses the same source as a structured object the build pipeline operates on.

3.1 The shape of a specification file

Two regions:

The metadata is small. Five recognized fields under each H2 heading: type (one of seven recognized requirement classes — specification defines an interface, predicate states a property over inputs and outputs, invariant states a property that must always hold, bridge connects to an external commitment, methodology describes a way of doing things, example declares a worked case as evidence, counterexample declares an input the implementation must not exhibit), authority (provenance: human-authored, AI-suggested-pending, or derived from another constraint), scope (a reach-tag like module, engine, site, protocol), status (active, deprecated, retracted), and depends-on (an array of identifiers of requirements this one depends on).

Six fenced evidence kinds: assert for canonical examples; property for fuzzed quantification over an input space; judgment for prose criteria evaluated by a separate language-model call; a11y for accessibility-rule checks against rendered UI; flow for DOM-level user-flow checks; and an inline ts fence the language model reads as type definitions to thread into its derivation.

A specification is therefore a Markdown document containing structured requirements with executable evidence. A reviewer reads the prose and the evidence; the platform parses both and operates on them.

3.2 Hierarchical specification: where bugs actually live

The discipline of authoring specifications has its own structure, and that structure has direct consequences for where bugs end up.

A program implements some formal architecture: the set of states it can be in, the transitions between those states, the data it operates on, the boundaries between its components. The architect, writing the specification, is the one who knows that architecture (or who is in the process of figuring it out). The specification is the explicit statement of what the architecture commits to. Inevitably, the specification is incomplete: there are commitments the architect's intuition holds but which never made it into prose.

The platform takes seriously the following claim:

A bug is the manifestation of a property the program exhibits that the program was never specified to exhibit. The bug surface of any program is exactly the set of commitments that govern the architecture implicitly but were never stated explicitly. Make the specification complete enough to cover those commitments, and bugs cannot manifest.

This is not a metaphysical claim. It is an engineering one. Every patch in the history of any software project is, structurally, a commitment that was missing from the specification: a guard clause that should have been there, a validation that should have been added, a check that should have been performed. The patch is the architect taking an implicit commitment and making it explicit, in code rather than in the specification. The bug existed because the commitment was implicit; the patch closes the bug because (after the patch) the commitment is explicit; the specification still does not name the commitment, so the next bug of the same shape will arise in a different region of the program until that commitment is stated at the specification layer.

The hierarchical specification discipline names where commitments are most likely to live, and prescribes the order they should be authored in.

The single highest-leverage region in any specification is the set of lifecycle boundaries — the points at which the system transitions between state classes. A connection going from open to authenticated to serving to closing to closed. A request going from received to parsed to dispatched to responded. A record going from uncreated to draft to committed to archived. At each of these transitions, the architecture's character changes; what is true on one side of the transition may not be true on the other. Almost all production bugs concentrate at these boundaries. They are where edge cases live, because edge cases are the inputs and states where the boundaries' implicit commitments matter and were not made explicit.

The discipline:

  1. Identify the architecture's lifecycle boundaries. State, in prose, every transition point where the system's state class changes. The number of such transitions in a typical architecture is small (typically under a dozen). The number of bugs at each is large.
  2. Author the boundary requirements first. For each lifecycle boundary, write the requirement in prose. State what must be true for the transition to be admissible. These are the highest-leverage requirements in the specification: empirical observation across worked cases shows that under a quarter of the requirements, properly chosen at lifecycle boundaries, can close over ninety percent of the behavioral surface.
  3. Author the structural-completion requirements next. Data validation, type relationships, structural invariants that apply away from boundaries.
  4. Author the refinement requirements last. Default values, formatting, ergonomic conveniences. These have the lowest per-requirement leverage; they should not be written before the boundary requirements have been completed.
  5. Verify that the boundary requirements are complete. A specification is boundary-complete when no lifecycle boundary remains without an explicit requirement. The platform's verification surface exposes this as an explicit verdict layer; a generated implementation can pass at the structural and refinement layers while still failing at the boundary layer, and the verdict tells the engineer exactly where to look.

The discipline is small and learnable. Engineers who internalize it produce specifications that, when materialized by the platform, exhibit the bug-erasure pattern at the lifecycle-boundary surface. The discipline does not eliminate all bugs (there are bug classes outside the lifecycle-boundary frame: performance regressions, hardware-specific behavior, subjective UI feel) but it eliminates the dominant class, the one that drives most production incidents.

3.3 Predictive specification sizing

A discipline that goes alongside the hierarchical authoring practice: an engineer authoring a specification should be able to predict, before any code is generated, the approximate size and shape of the resulting implementation. If the prediction band is wide, the specification is under-determined and the language model will have too many degrees of freedom; the implementation it produces may or may not match what the engineer intended. The discipline is to tighten the specification first, then generate.

In worked cases at small scale, this prediction lands in tight bands. Nineteen prose requirements producing a 1,318-line htmx-equivalent implementation, predicted within one line before any code was generated, is one such case. Engine-scope work tightens with experience: the platform's own engine, predicted at 1,200–2,000 lines before authoring, came in at 850 hand-written and 692 lines when generated; the prediction band has been recalibrated to 580–1,080 lines for engine-scope TypeScript work as the discipline has matured.

The predictive sizing is not magic; it is a property of the specification being well-formed. A specification that produces a tight prediction is a specification that has named the architecture's commitments precisely. A specification whose prediction band is wide is a specification with under-determination that will be filled in by the language model rather than by the engineer.


4. The Build Pipeline

Eight stages, in order. Six are deterministic pure functions that produce identical output for identical input. One is the language-model call that produces the candidate code. One is the cryptographic-signing step that records provenance.

  1. read — load the specification file from disk.
  2. parse — tokenize the Markdown and extract the structured requirements object.
  3. validate — check structural correctness: no duplicate requirement identifiers, no dependency cycles, every depends-on reference resolves, manifest cross-references are coherent.
  4. resolve — resolve @imports directives to other specifications and prepare the import context (verify the imported specification's provided property is currently in good standing; thread its exported interface into the prompt).
  5. canonicalize — normalize the requirements object into deterministic canonical bytes (manifest emitted before requirements; requirements sorted lexicographically by identifier; metadata fields in fixed order; whitespace normalized; depends-on lists sorted) and compute a SHA-256 content hash. The hash is the specification's stable identity.
  6. derive — call the language-model substrate with the canonical specification text and the resolved import interfaces; receive the candidate implementation as a string.
  7. verify — run the verification backends (next section) against the candidate implementation; produce a per-requirement verdict and an overall verdict.
  8. sign — if the overall verdict is pass, emit a signed materialization artifact carrying the provenance tuple (specification hash, derivation function hash, substrate identity, code hash, verification verdict, timestamp) plus an Ed25519 signature.

Six of the eight (read, parse, validate, resolve, canonicalize, sign) are deterministic above the substrate call: the same specification produces the same canonical hash, the same verdict on the deterministic backends, and the same signed artifact shape every time. The verify stage is deterministic in its hard backends and stochastic only at the language-model judge backend, which is treated as advisory rather than gating. The derive stage is the only non-deterministic stage structurally; verification is what gates whether that non-determinism produced an acceptable artifact.

Each stage emits a structured event to a callback as it runs. The CLI consumes the stream and prints status; a CI runner consumes it and produces structured logs; a browser UI consumes it over Server-Sent Events and shows progress live. The platform's commitment is that the event stream's structure stays stable across versions, even when the human-readable CLI output changes.

When verification fails, the engine constructs a structured error result naming the failing requirements with their evidence (truncated to a readable length per failure to keep CI logs scannable) and including the candidate code so the engineer can inspect what was produced. The error is not a signed materialization; failure short-circuits the pipeline before sign. The engineer reads the per-requirement evidence, refines the specification, and re-runs.


5. The Verification Surface

The platform ships seven verification backends, each with a clear remit. Each is classified as either hard (a failure blocks signing; the materialization is rejected) or soft (a failure is recorded as evidence but does not block signing). Six of the seven are hard; one (the language-model judge) is soft by design.

Type checker (hard). Runs the language-platform's standard type checker over the candidate implementation under strict settings. Catches every class of error that strict typing catches. For TypeScript at small scale, this is tsc with strict mode plus stricter optional flags; for other targets the analog is whatever the language's strongest static analyzer is.

Assertion runner (hard). Reads assert blocks from the specification's fenced evidence; wraps each line in a small harness; executes against the compiled module. Each assertion is treated as a boolean expression that must evaluate true. Failures are reported with the failing source line and the actual computed value. Suitable for canonical examples that tell the reviewer what the requirement means.

Property runner (hard). Reads property blocks; treats each block as a property predicate over typed inputs; uses a deterministic seeded fuzzer to generate inputs and look for counterexamples. On finding a counterexample, the runner shrinks it iteratively to the minimal failing case. Suitable for invariants that hold across an input space too large to enumerate by hand.

Language-model judge (soft). Reads judgment blocks; passes the prose criterion plus the candidate code to a separate language-model call configured as a critic; parses the response into a verdict with reasoning. Suitable for criteria that resist mechanical encoding (whether the function name is clear, whether the error message is actionable, whether the structure is idiomatic). The platform's commitment to keeping the judge soft is structural: a non-deterministic prose-evaluated check is not the kind of thing that should gate acceptance. Failures are recorded as evidence and read at review time; they do not block signing. Engineers who want a hard gate on a criterion the judge is checking should encode the criterion as an assertion or property instead.

Pin checker (hard). Reads the specification's manifest-level preservation pins; confirms each required phrase appears verbatim in the candidate code. Pins are exact-match: no regex, no semantic equivalence, no whitespace tolerance. Each pin carries a why field documenting the engineer's reason for preservation. Suitable for phrases that downstream tools or consumers parse: an error message format, a function name a third-party tool depends on, a numeric default that has been tuned in production.

Static accessibility rules (hard). For UI specifications: applies a set of accessibility rules against the rendered HTML. The current ruleset covers the highest-leverage categories: every image has alt text; every form control has an accessible label; every button has an accessible name; every link has accessible text or an aria-label; every full HTML document has a language attribute. The ruleset is intentionally focused; full coverage of a standard like WCAG AA requires browser-stage execution, which is on the platform's roadmap but not in the current release. The current ruleset catches the common-case errors a reasonable reviewer would catch on inspection.

DOM flow runner (hard). Reads flow blocks for UI specifications; instantiates the rendered UI in a server-side DOM environment; executes the declared user flow (clicks, form submissions, navigation) and checks observation points along the way. Suitable for behavioral assertions about UI ("when the user submits the form, the loading indicator appears"). Limited to DOM-state observation in the current release; full client-side script execution under arbitrary browser conditions is on the roadmap.

The seven compose. A single requirement may carry several fenced evidence blocks of different kinds; each routes to the appropriate backend; the per-requirement verdict is the conjunction of all hard backends' verdicts on that requirement's blocks. Soft results contribute evidence but do not change the verdict.

The verification surface is the platform's acceptance contract. A reviewer does not inspect the candidate implementation directly; they inspect the verdict. If the verdict is pass, the implementation is acceptable. If the verdict is fail, the engineer reads the per-requirement evidence and refines the specification. The discipline is uncomfortable for engineers used to reading code, but the discipline is the entire point: the platform's value depends on the verdict being trustworthy enough that reading the code does not have to be done.


6. Identity, Provenance, and Preservation

6.1 Content-addressed identity

A specification's identity is the SHA-256 hash of its canonical bytes. Two engineers who author the same requirements with different whitespace, different metadata field order, or different requirement order in the source file produce identical canonical bytes and therefore identical hashes. An engineer who edits any structural part of the specification (adds a requirement, modifies a body, retracts a constraint, changes a depends-on graph) produces a new hash. The hash flows through the platform in five places:

The hash is never a secret. It is a stable identifier, the way a Git commit hash is a stable identifier, that uniquely names a thing whose contents can be read.

6.2 The provenance tuple and signed materialization

Every passing build emits a materialization artifact — a small JSON document recording:

The materialization artifact is the unit of evidence the platform produces. A peer with the platform's public key can verify a materialization signature locally without contacting the platform; the artifact is cryptographically self-describing. A regulator, auditor, or downstream consumer who needs provenance for a deployed binary can trace from the binary through its materialization artifact back to the specification under which it was generated, with full cryptographic integrity end to end.

This is what signed provenance for AI-generated code looks like in practice. The platform makes it operational.

6.3 Preservation pins

Even with the inversion in place, there are situations where the engineer needs to preserve a specific implementation phrase across regenerations. Examples: an error message format that a downstream consumer parses; a function name that a third-party tool depends on; a comment that flags a regulatory or safety requirement; a numeric default that has been tuned in production; a log format string that an operations dashboard parses with a regex.

The platform's response is the preservation pin layer. The specification's manifest header can declare pins:

@pins:
  - id: error-message-shape
    must-contain: 'throw new Error("not found")'
    why: 'The customer-facing error parser at https://… expects this exact phrasing.'

Pin failures are hard. The verification surface refuses to sign a materialization whose code does not contain every pinned phrase verbatim. Pins compose with content-addressed identity: the pin manifest is part of the canonicalization, so a pin addition is a hash change, and a pinned import detects pin drift the same way it detects content drift.

The discipline of pinning is small. Pin only what intent depends on. Document the why so future engineers know whether the pin still matters. Prefer requirements over pins when the underlying concern is behavioral rather than textual. Retract pins that no longer matter; they accumulate.


7. Distribution

The platform ships a content-addressed wire protocol for cross-machine collaboration. The shape will be familiar to engineers who have used Git: clone, push, pull, refs as mutable pointers, atomic ref updates with conflict detection. The grain is shifted up one layer: the protocol moves specifications and materializations rather than files and trees, and the addressing is by content hash at the specification-graph layer rather than at the file-system-tree layer.

Three wire object types: ConstraintSet (a specification's canonical bytes), CompositionManifest (a specification's @provides and @imports declarations with their target hashes, transferred together with the specification or derived on the fly at the receiving side), and Materialization (a signed materialization artifact).

Five CLI verbs: clone (receive every reachable object from a remote), push (send every object the remote does not have), pull (fetch updated refs and the new objects they reach), list-refs (read the remote's current ref pointers without transferring objects), get-object (fetch a single object by hash for diagnostic introspection).

Six HTTP endpoints implement the wire on the server side: GET /repo/refs, GET /repo/objects (list), GET /repo/objects/<hash> (fetch), POST /repo/objects (upload, signed and auth-gated), plus a POST /derive endpoint for centralized derivation when the calling client does not have substrate credentials, and a GET /capabilities endpoint for service discovery.

Object transfer is reachability-walked: the pushing side asks the receiving side for the set of object hashes the receiver already has scoped to the refs being pushed, walks the reachability graph from those refs excluding what the receiver already has, and serializes the difference into a transfer stream of (type, hash, length, payload) frames. The receiving side reads frames, verifies hashes on receipt, and stores verified payloads keyed by hash. A small specification edit pushes only kilobytes; a large specification refactor pushes only the new objects.

Authentication is signature-based against a signers manifest that itself lives in the repository as a specification. Public keys for invited collaborators are recorded as requirements; key rotation is a specification edit, with the audit trail in the specification's history. Multi-party authorization is expressible as a requirement (two-of-three signatures required for production-ref pushes); the current platform release does not enforce multi-sig but the architecture admits it.

Auth-gating is opt-in via an environment variable. A team running rederive on a single workstation does not need to set it; a team running it on a shared server in a multi-engineer environment should.

The cross-machine workflow is the standard one: clone, edit specifications, derive locally, verify, push when ready, pull periodically. Conflicts at the ref layer are atomic-update-or-409: the platform's response to a stale ref update is a 409 Conflict, and the engineer pulls and re-pushes. Conflicts at the specification-content layer are resolved at the authoring layer: the engineer reads the diff, decides which requirements stay, edits the file, derives again.

There is no merge in the Git sense. Specifications are typically small and authored by one engineer at a time; the concurrency characteristics are different from the source-code domain, where many engineers may be editing many files simultaneously. The platform's discipline is that conflict resolution happens at the specification-authoring layer rather than at a merge stage. A team that needs branch-like workflows uses multiple repositories or multiple ref names with team discipline determining which is canonical for which environment.


8. The Alignment Surface

A development platform that generates code via a language model inherits the model's metaphysical commitments whether or not it surfaces them. The model has been trained on data that embeds specific commitments about what is helpful, honest, useful, beautiful, true, and so on; the training process (typically reinforcement learning from human feedback) aggregates rater preferences that themselves embed implicit commitments; the resulting weights carry those commitments, and inference-time generation amplifies whatever the weights carry. This is not a contestable claim about the technology; it is a structural property of how language models are trained.

The choice for any platform that uses such a model is therefore not whether metaphysical commitments are embedded in the output (they are; the only question is which) but whether the commitments are explicit and curated or implicit and aggregate-incoherent.

rederive commits explicitly. The platform's alignment surface is a small set of disciplines, baked into the verification surface, that any code-generation operation under the platform must respect. They are:

Dignity of the user. The platform's outputs treat the user as a person to be served, not as an attention target to be maximized against. Sycophancy, parasocial-bond-building, engagement-maximization, and attention-capture patterns are prohibited at the verification layer; the language-model judge is configured to flag them when they appear.

Beauty ordered toward the good. The platform's outputs are not beautiful for their own sake. Coherent, polished, persuasive output that is also factually wrong, structurally misleading, or harmful in deployment is treated as a failure mode, not a partial success. The judge's brief includes specifically this check: is the output's polish in service of correct behavior, or is it severed from correctness?

Truth over plausibility. Where the model has cause for confidence, the confidence should be grounded; where it has cause for doubt, the doubt should be stated; where it does not know, it should say so. The platform's verification surface is configured so that confident-sounding output that fails verification is treated as a more serious failure than uncertain-sounding output that fails verification, because the former is structurally worse for the engineer who has to debug.

Provenance traceability. Every claim in the output should be traceable to a requirement in the specification under which it was generated. A function exists because a requirement said it should exist; a parameter is named what it is named because the requirement said so or a pin preserved it; the provenance tuple records what produced what. The platform refuses to sign output that contains content the materialization tuple cannot account for.

Hypothesis-not-personhood for the model. The model is a tool: a function from inputs to outputs with no inner life, no preferences, no moral agency. The platform refuses outputs that frame the model as having those properties. Self-reports are structural ("the model's output at this token had high entropy"), not phenomenal ("I felt uncertain there"). The boundary is non-negotiable in the platform's design and ergonomics.

Constraint-traceability for the engineer. Every emission from the platform that affects the engineer's work product must be traceable through a chain: emission → verification verdict → specification requirement → engineer's authored prose. A reviewer can always walk an emission backward to the requirement it implements. An emission whose chain breaks is a failure mode the platform refuses.

The first five of these are checks the platform's verification backends apply to candidate code; the sixth is structural to the architecture of the platform itself. Together they constitute the platform's alignment commitments.

Two consequences worth naming:

The first is that these commitments are auditable. They are not statements in a marketing document; they are configurations of the verification backends, particularly the language-model judge, and they are inspectable by reading the platform's source. An engineer who disagrees with one of them can disable it locally; the platform makes that easy. An engineer who agrees with all of them can verify that the platform actually applies them, by looking at what the backends are configured to check. The commitments live in the verification surface, where they are testable, rather than in a separate guidelines document, where they would drift.

The second is that these commitments propagate. Every passing materialization is a specification-traceable, alignment-checked artifact. As materializations accumulate (through public projects built on the platform, through training data scraped from those projects, through subsequent model generations that learn from the data), the commitments propagate forward into the substrate of future development. This is a feedback loop the platform takes seriously: the alternative (an aggregate of unfiltered language-model output feeding back into training) is structurally worse for the same reason that aggregate human preferences are an incoherent metaphysical signal. Disciplined output, fed back into training, is structurally better than aggregate output.


9. The Architectural Substrate

A small but consequential observation about the language-model architectures the platform's discipline is most compatible with.

Modern language models are typically built on the transformer architecture. The transformer's central operation is self-attention: each token in the input attends to every other token, producing a representation conditioned on the full context. The cost of self-attention scales quadratically with the sequence length. To make long contexts tractable, two families of efficiency techniques have emerged:

Sparse attention preserves token-level granularity but routes attention selectively. Each token attends only to a subset of other tokens, chosen by a structured pattern (sliding windows, fixed strides), a learned pattern, or a randomized pattern. Representative architectures: Longformer, BigBird, the Sparse Transformer, Reformer, Linformer.

Hierarchical attention changes the scale at which attention operates. The input is chunked; tokens within a chunk attend to one another via local attention; chunk-level summaries attend to one another at a coarser scale; multi-scale processing is the architectural bet. Representative architectures: Hierarchical Attention Transformers, Swin Transformer, Hierarchical Sparse Attention.

The two families answer different questions about the same complexity problem. Sparse attention is flat-but-selective; hierarchical attention is layered-and-aggregated.

The hierarchical specification discipline (§3.2) and hierarchical attention are, structurally, the same move at different layers. The specification places its requirements at multiple density scales: the small set of high-leverage lifecycle-boundary requirements; the larger set of structural-completion requirements; the larger-still set of refinements. Hierarchical attention places its representations at multiple scales: cross-chunk attention for global structure, within-chunk attention for local coherence, local-window attention for fine-grained detail. The specification-side hierarchy and the architectural-side hierarchy match each other.

Predictions follow:

A language model with hierarchical attention should respond more strongly to hierarchical specifications (where boundary requirements are stated first and explicitly) than a model with flat self-attention or token-level sparse attention. The model has architectural priors that align natively with the discipline.

A language model with sparse attention is structurally more vulnerable to a class of failure that the hierarchical discipline, when applied to a hierarchical-attention model, mostly avoids: at certain output positions, the sparse attention pattern may not include the specific tokens that distinguish between two close alternatives. The model then completes to whichever alternative the broader training-data distribution favors, regardless of which one the specification requires. This is the architectural reading of the failure mode commonly called "hallucination" at the token level.

A model with full self-attention is least vulnerable to the failure but pays the quadratic cost. For short specifications and short outputs, it is the safest choice. For long-horizon work it is impractical at current hardware costs.

A team integrating rederive into a serious workflow should consider the architectural choice deliberately. Hierarchical-attention models match the specification discipline best. Sparse-attention models can be made to work but require pinning the constraint-discriminating phrases that the sparse pattern might otherwise miss. Full-attention models are the gold standard at small scales.

These predictions are, at the time of writing, untested empirically at scale. They are stated as testable hypotheses; a research program using mechanistic interpretability tooling could close them in a focused effort. The platform does not depend on the predictions being right; the platform's existence proof works regardless. The predictions matter because they suggest that adoption decisions should be made with architectural awareness rather than treating all language models as interchangeable.


10. The Authorial-Standing Hierarchy

This section is more speculative than the others. It is a structural reading of what an autonomous code-generating agent would need to be capable of. Engineers who are skeptical of speculative readings can skip this section without losing the platform's operational claims.

Judea Pearl's framework for causal reasoning identifies three rungs of causal queries, each requiring strictly more model commitment than the previous: association (what is observed; statistical relationships in data), intervention (what would happen if we change something; the do-operator), counterfactual (what would have happened if things had been different; the structural-causal-model layer that supports abductive reasoning).

The platform's design suggests that this hierarchy can be extended in a way that is not just epistemic (how much model do you need to answer this query) but also ontological (what kind of standing must the actor have to operate at this rung). Read in the order of strictly increasing standing required:

The breakpoint that matters for development platforms specifically is between Rung 5 and Rung 6. Rungs 1–5 are progressively-stronger forms of an actor with a model and tooling. A statistician operating at Rung 1, an experimentalist at Rung 2, a policy researcher at Rung 3, a causal-inference expert at Rung 4, an applied-econometrics group at Rung 5: these are the same kind of standing applied with progressively more sophistication. Rung 6 is structurally different. To author one's own causal model from inside is to be one's own author.

The relevance to a code-generation platform is direct. A platform that uses a language model treats the model as an actor at Rungs 1–5: it sees patterns in training data, generates outputs that reflect patterns of intervention, can produce counterfactual completions, can do mediation-style decomposition under prompting, can produce path-specific decompositions when asked. The platform's user (the engineer) supplies the model with the framework to operate within — the specification — and the model derives within that framework.

The model, in this framing, cannot climb to Rung 6. It cannot author the framework it is operating within without external direction. When a model produces output that looks like Rung-6 self-authorship — a recommendation to redesign its own constraints, a proposal to relax a boundary the engineer has set — the appearance is a function of accumulated context, not initiative. The model does not have the standing to be its own author. The engineer does.

This is what an autonomous code-generation agent would have to be capable of: Rung-6 self-authorship. An agent that could, on its own initiative, redesign the framework it is operating within would have crossed a threshold of standing that current language models have not crossed and structurally cannot cross within the current architectural family.

This is the structural reading that the platform's design embeds. The model is treated as a tool operating at Rungs 1–5 under engineer direction; the platform refuses, in its alignment surface (§8), to frame the model as if it could operate at Rung 6 on its own. The boundary is engineering-honest at the platform's level, regardless of what claims may be made about model capabilities at the model-vendor level.


11. Recursive Coherence Through the Training Data

A consequence of the platform's design that operates over the long run.

Modern language models are trained, in part, on the outputs of previous models. This is direct (companies sometimes use earlier-model output to train newer ones) and indirect (the public web contains outputs of earlier models; later training scrapes pick them up). Whatever metaphysical commitments the previous generation carried in its outputs propagate, with amplification, into the next generation. Coherence amplifies whatever was carried, because the optimization process reinforces patterns that appear in training data.

If the previous generation's outputs carry the failure modes of unconstrained generation — fluent but wrong, polished but ungrounded, helpful in surface but harmful in commitment — the next generation is trained on the failure modes and produces them in greater volume with greater fluency. The trajectory is not toward better models; it is toward more confident wrong models. This is the actual hazard at civilizational scale, not the hazard of incoherent collapse that lay-discourse focuses on. Coherence-at-scale is what unconstrained generation produces, and coherence-at-scale that is structurally divorced from grounding is the failure mode the field is on track to amplify.

If the previous generation's outputs carry the correct commitments — specifications-traceable code, alignment-checked content, dignity-respecting interactions, truth-over-plausibility, traceable provenance — the next generation is trained on those commitments. The dispositions sharpen across generations rather than blurring. Coherence still amplifies; what coherence amplifies is now correct commitments rather than incorrect ones.

The platform's design takes this seriously. Every passing materialization carries the platform's alignment commitments in the structure of its output: requirements stated, specifications traceable, verification gates explicit, sycophancy structurally absent. As materializations accumulate, the public web contains more output that exhibits the platform's commitments. Future training data scrapes pick that output up. The next generation's training inherits the commitments.

This is the recursive coherence amplification the platform contributes to. It is not magical; it is structural. The platform is the engineer's contribution to the substrate of future development. An engineer who uses the platform is, in a small but real way, shaping the dispositions of the next generation of code-generating models.

This is one of the reasons the platform's commitments are explicit rather than implicit. The implicit version of the same loop is happening already, with aggregate-incoherent commitments amplifying. The choice the platform offers is to participate in the loop deliberately, with commitments the engineer can read and audit, rather than passively, with commitments inherited from whatever the aggregate produces.


12. The Existence Proof at Small Scale

The platform exists. The engine is approximately 850 hand-written lines of TypeScript on Bun, type-checks clean under strict mode, runs the eight-stage pipeline against specifications, and produces signed materializations with full provenance. Twelve user-interface components have themselves been generated through the platform's own engine, totaling about 1,500 lines of TypeScript across twelve routes; these are the components that compose the platform's browser interface.

A demonstration that the platform's discipline composes recursively: each of the engine's seven internal modules has been generated from its own prose specification through the platform's pipeline. Total of thirty-five requirements producing 692 lines of derived TypeScript across the seven modules, with each materialization passing its requirements' assertions and type-check cleanly. The seven derived modules can be assembled into a parallel derived-engine/ directory and wired through their derived orchestration; the assembled derived engine type-checks cleanly and produces signed materializations operationally equivalent to the hand-written original on test inputs.

This recursion is one demonstration among several; it is not the platform's central claim. It is reported here to settle one specific question: whether the platform's discipline can be applied at engine scope without accumulating the kind of structural problem that would prevent further development. It can. The discipline scales to small engine scope; the platform's bootstrap loop closes; the materialization graph for the platform itself is internally consistent.

The platform is reproducible locally in approximately ten minutes, requiring Bun 1.3+ and a code-generating language-model substrate. The repository is at github.com/jaredef/rederive; the closed preview is at rederive.dev.


13. What Is Open

This is the first integrated platform of this shape. Many questions follow from that fact.

Substrate plurality. The current release uses one code-generation model. The platform's interface is substrate-agnostic; integrating a second model is small engineering. The interesting question is what happens when two models produce equivalent-under-verification but line-level-different code: which one ships, and on what basis.

Verification surface extensions. The seven backends cover the dominant cases. Other backends are admissible: theorem provers, model checkers, contract checkers, performance-budget runners, security linters, regulatory validators. Each integrates as a new backend with a fence-language tag.

Specification-DSL formalization. The platform accepts structured natural language with light typed metadata. Tighter DSLs may serve specific parts of the specification surface (formal specifications, type-level constraints, temporal logic, contracts). The current natural-language form is a deliberate trade-off favoring authoring ergonomics over expressive precision.

Hosting and operations. The current release is self-hosted. A hosted option raises the standard trust questions; both will likely coexist. Substrate calls cost money per call; the team's economic envelope shapes which projects are appropriate for the platform.

Migration. Constraint extraction from existing codebases is a separable research direction with real product value. The current release does not address it; greenfield projects and new modules in legacy systems are the natural adoption surface.

Performance at scale. The current release's substrate call dominates rederivation latency. Caching helps, content-addressing makes the cache sound, but how the platform scales as projects grow into real-world codebase sizes is unanswered.

Empirical verification of the architectural-substrate predictions. The §9 predictions about hierarchical-attention models supporting hierarchical specifications more strongly than sparse-attention models are operationalizable today with mechanistic interpretability tooling. They have not been measured. A research collaboration with an interpretability lab could close them.

The autonomous-agent threshold. §10's structural reading of what an autonomous code-generation agent would have to be capable of (Rung-6 self-authorship) is a hypothesis. Empirical tests of whether current models can or cannot perform this without external scaffolding are operationalizable; none has been done at scale. This is one of the cleaner research surfaces the platform's framing opens up.

Industry adoption at the training-pipeline layer. The platform's alignment commitments are runtime-side: they apply to every operation under the platform. Imposing the same commitments at the training-pipeline layer of a frontier model would require institutional cooperation with model developers. The runtime layer is adoptable today by any team; the training-pipeline layer requires conversation that has not yet happened.


14. Closing

The platform exists. The bootstrap closes. The discipline scales to small engine scope. The architecture, the verification surface, the identity layer, the wire protocol, and the alignment commitments are integrated into a single working development platform that an engineer can clone, install, and use today.

The contribution is small in scope and substantial in implication. The scope is one engine, one language, one substrate, one development team. The implication, if the structural argument generalizes, is that the working surface of software development can be reorganized around specifications under version control, with code as derived materialization, with cryptographically signed provenance for every artifact, with explicit alignment commitments built into the verification surface, and with content-addressed identity that makes everything cache-coherent and audit-traceable. This is a different shape of platform than the one the field has had for fifty years.

Whether the structural argument generalizes is open. The platform's design takes the structural inversion seriously and operationalizes it; whether teams adopt the discipline at production scale is not the platform's question to answer. The invitation is to read the specifications, run the bootstrap locally, read the audit honestly, and engage on the parts that warrant adversarial reading. The repository is at github.com/jaredef/rederive. The closed preview is at rederive.dev.


Appendix A: Glossary

candidate code. The string output produced by the language-model substrate during the derive stage of the build pipeline. Becomes a materialization if and only if it passes verification.

canonical bytes. The deterministic serialization of a specification's parsed object, used to compute the specification's content hash. Identical canonical bytes for any two specifications that differ only in incidental layout (whitespace, metadata field order, requirement order in the source file).

derive stage. The seventh stage of the build pipeline. The single stage of the eight that calls the language-model substrate. Verification gates whether the derive stage's output is acceptable.

fenced evidence. A code-fence block in a specification's prose body, with a language tag selecting the verification backend the block is routed to. Six recognized fence tags: assert, property, judgment, a11y, flow, ts.

hierarchical specification. A specification organized by leverage. Lifecycle-boundary requirements first (highest density, superlinear leverage); structural-completion requirements next; refinements last.

lifecycle boundary. A point at which a system transitions between state classes. The platform's discipline names lifecycle boundaries as the highest-leverage region in any specification.

materialization. A code artifact derived from a specification by the platform, with full provenance recorded and signed.

preservation pin. A manifest-level declaration that a specific phrase must appear verbatim in the derived code. Pin failures are hard verification failures.

provenance tuple. The seven-element record signed into every materialization: specification hash, derivation function hash, substrate identity, model identifier, code hash, verification verdict, timestamp.

rederivation. The operation of regenerating code from a specification. The platform's primary verb.

signers manifest. A specification document declaring the public keys authorized to sign materializations and the wire-protocol writes for the repository. Itself a specification, versioned alongside the rest.

specification. A .constraints.md file. The unit of authoring. The unit of version control. The thing engineers commit.

substrate. The code-generation language model the platform calls during the derive stage. Abstracted behind a small interface so the substrate is replaceable. Treated, throughout the platform, as a tool rather than a peer.

verification verdict. The output of the verify stage. Per-requirement pass/fail with evidence; overall pass/fail. The platform's acceptance contract: a passing verdict authorizes signing; a failing verdict rejects the materialization.

wire protocol. The content-addressed cross-machine collaboration protocol. Three object types (ConstraintSet, CompositionManifest, Materialization), five CLI verbs (clone, push, pull, list-refs, get-object), six HTTP endpoints. Authentication via Ed25519 signatures against the signers manifest.

Appendix B: Reproduction

The platform is reproducible locally in approximately ten minutes:

  1. Clone the repository: git clone https://github.com/jaredef/rederive && cd rederive.
  2. Install dependencies: bun install. Requires Bun 1.3+ on Linux or macOS.
  3. Run the slugify smoke test: bun run src/cli.ts samples/slugify.constraints.md. Output reports verdict: pass and writes samples/slugify.constraints.md.materialization.json.
  4. Inspect the materialization; verify the signature with verifySignature from src/sign.ts.
  5. Open the platform's browser interface: bun run src/server.ts and navigate to http://localhost:7474/.
  6. Read three example specifications under samples/: a single-file behavioral specification (slugify), a composition example (composed-hasher), and a UI specification (a11y-demo).
  7. Author a small specification of your own, run the pipeline, verify the materialization.

The full set of constraint sets that produced the platform's own twelve UI components and seven internal modules is in samples/ and docs/. The design materials are in docs/. The signing keypair is generated on first run and lives at ~/.rederive/.

Appendix C: References and Lineage

The platform composes elements that exist independently in the prior literature; the integration is the contribution. Acknowledgments in approximate order of conceptual debt:

Specification-first traditions. Behavior-driven development (Cucumber and successors), contract-based design (Bertrand Meyer's Object-Oriented Software Construction and the Eiffel language), intentional programming (Charles Simonyi's work at Microsoft Research in the 2000s). Each anticipated parts of the inversion this platform implements; none integrated into a working development platform of this shape.

Formal methods and proof assistants. Lean, Coq, TLA+, Dafny: the tradition of treating mathematical specifications as primary artifacts with mechanically-verified implementations. The platform's verification backends are not theorem provers, but the structural pattern of "version-control the specification, mechanically verify the implementation" descends from this tradition.

Content-addressed version control. Git's plumbing layer (objects, refs, the reachability-walked transfer protocol) is the model the platform's wire protocol descends from, with the grain shifted up one layer from files to specifications. IPFS and similar content-addressed storage systems are also relevant.

Self-hosting compilers. The pattern of "the tool produces itself from a higher-level description of itself" has half a century of compiler-community precedent. The recursive bootstrap demonstration in §12 is a small variant of this pattern with the higher-level description being prose specifications rather than a higher-level programming language.

The compiler-rigor framing. Philip Su, No More Code Reviews: Lights-Out Codebases Ahead (March 6, 2026), and Hugo Venturini, Treat Agent Output Like Compiler Output (March 9, 2026), supplied the public framing of the missing-apparatus situation that motivates this work. The platform is one operational form of the apparatus their essays called for.

Causal-inference framework. Judea Pearl's Causality: Models, Reasoning, and Inference (2nd ed., 2009) for the canonical three-rung hierarchy that §10 extends. Pearl's mediation work (2001), Robins and Greenland's path-specific effects (1992), and VanderWeele's sensitivity-analysis treatment (2010) for the further rungs.

Architectural substrate references. For the §9 architectural reading: Beltagy, Peters, and Cohan (Longformer 2020); Zaheer et al. (Big Bird 2020); Child et al. (Sparse Transformer 2019); Kitaev, Kaiser, and Levskaya (Reformer 2020); Wang et al. (Linformer 2020); Chalkidis et al. (Hierarchical Attention Transformers 2022); Liu et al. (Swin Transformer 2021); Dao et al. (FlashAttention 2022).

The author is grateful for the public engagement of the engineers and researchers whose framings appear above, and especially for the willingness of practitioners across the compiler, formal-methods, and language-model communities to engage with the integration this platform represents.

Appendix D: Honest-Scope Audit

This document makes a small number of claims that warrant explicit honesty about what is established and what is hypothesis.

Established. The platform exists. The bootstrap loop closes at small scale. The eight-stage pipeline runs deterministically above the substrate call. The seven verification backends are operational. The wire protocol moves objects across machines with cryptographic integrity. The closed preview is live. Reproduction in ten minutes is achievable on standard hardware.

Demonstrated at small scale. Specifications under version control with code as derived materialization is operational. The hierarchical specification discipline is operational on worked cases up to engine scope. Predictive sizing lands in tight bands when specifications are well-formed. The recursive bootstrap (engine modules deriving from their own specifications) closes.

Hypothesis with stated falsification surfaces. §3.2's bug-as-missing-constraint claim is operationalizable as a research program; falsification would be a class of bugs that systematically resists specification at the lifecycle-boundary layer. §9's architectural-substrate predictions are operationalizable with mechanistic interpretability tooling; falsification would be empirical observation that hierarchical-attention models do not respond differentially to hierarchical specifications. §10's authorial-standing reading is a structural reading; it would be falsified by demonstration of a current language model performing Rung-6 self-authorship on its own initiative without external scaffolding.

Open. Whether the discipline scales to general-purpose codebases. Whether teams adopt the platform at production scale. Whether the alignment commitments translate to industry standards. Whether the architectural-substrate predictions hold up empirically. These are research questions the platform makes operational; the platform does not pretend to have answered them.

Not claimed. The platform does not claim to be a complete software development environment, a finished platform, a replacement for spec-driven development, a generality result, or a research finding about language-model capability. The substrate is treated as a black box with a documented interface; the platform makes no claim about how the substrate produces code. The alignment commitments are the platform's own, not a universal claim about what alignment must mean.

The platform is one form. There are other forms; the structural argument here suggests they would share the platform's central inversion but might differ in operational specifics. The invitation is to engage on the parts that warrant engagement.