Paper 2: Algebraic Foundations
Algebraic foundations of staged durable execution.
On this page
- Abstract
- 1. Core Model
- 1.1 Graphs, States, and Validity
- 1.2 The Staged Machine
- 2. Algebraic Structure
- 2.1 Accumulation as a Commutative Action
- 2.2 Failure Closure
- 2.3 Classification
- 2.4 Determinism of the Staged Reduction
- 3. Structural Recovery Safety
- 3.1 Theorem Statement
- 3.2 What the Theorem Does and Does Not Say
- 3.3 Dependency Structure
- 3.4 Proof Status
- 4. The Forward Fragment and the Reset Boundary
- 4.1 Forward Fragment
- 4.2 Reset Boundary
- 5. Extension Boundary
- 5.1 Extensions That Preserve the Core
- 5.2 Extensions That Cross the Boundary
- 6. Relation to Other Models
- 6.1 CALM
- 6.2 CKA and BSP
- 6.3 Actor Confluence Work
- 7. Open Problems
- 8. Takeaway
- References
Algebraic Foundations of Staged Durable Execution
Abstract
We develop an algebraic account of durable workflow execution organized as a staged reduction: commutative fact accumulation, idempotent failure closure, and deterministic state classification. The core object is not a workflow program trace but a topology-indexed graph state together with a validity predicate for normalized, closed states. We separate three layers of obligation: algebraic facts about accumulation, closure, and classification; structural recovery theorems for fixed-topology execution; and external assumptions about worker I/O reproducibility. The result is a sharper statement of what the runtime proves, what it merely assumes, and what is deferred to mechanization. We also characterize the extension boundary: which changes preserve the staged reduction and which require a different machine. Topology-changing graph rewriting is treated as one such boundary and is excluded from the core theory of this paper.
1. Core Model
1.1 Graphs, States, and Validity
Let be a finite DAG. A graph state consists of:
- a total status map
- a partial output map
For the fixed-topology core:
We distinguish three kinds of states:
- mid-accumulation states, which may still contain
- normalized states, after
resetRunningToPending - closed states, after
propagateFailure
The main validity predicate is defined only on normalized, closed states.
Definition 1 (wellFormedGraphState). holds iff:
- outputs appear exactly on statuses that semantically carry outputs
- no node is or
- every or descendant of a node is itself
- is extensionally equal to the readiness predicate: nodes that are and whose predecessors are all , , or
This definition is intentionally relative to fixed topology. Once topology changes, the validity contract must be enriched with materialization and identity invariants.
1.2 The Staged Machine
The fixed-topology runtime factors as:
Where:
accumulate = foldl (flip applyNodeFact)close = propagateFailureclassify = classifyGraphState
The execution model uses barriered frontier waves, but the algebraic core only sees a set of node-local facts and a topology-indexed state.
2. Algebraic Structure
2.1 Accumulation as a Commutative Action
Let F be the free commutative monoid of NodeResult values with distinct NodeIds. Distinctness
is justified by the frontier antichain property.
applyNodeFact induces a monoid action:
The load-bearing property is not a deep join law on shared state. It is the more modest disjoint-key commutativity of point updates.
Lemma 1 (disjoint-key commutativity). If , then:
This is the precise theorem. References to LVars, CRDTs, and related work should be read as analogies in design style, not as direct reuse of their stronger shared-key commutativity results.
2.2 Failure Closure
propagateFailure is treated as a closure operator on the preorder “has at least as many failed
consequences as”:
- extensive
- monotone
- idempotent
These laws are what make partial persistence structurally safe. In this draft they are stated as semantic obligations of the implementation and deferred to the Lean mechanization plan.
2.3 Classification
classifyGraphState is a deterministic decision procedure on normalized, closed states:
Progressingif some node is readySettledif all nodes are terminalSuspendedif some node is waiting and no node is readyStuckotherwise
The case split is by construction once readiness, terminality, and waiting are fixed.
2.4 Determinism of the Staged Reduction
Theorem 2 (staged reduction determinism). For any permutation of a frontier result set:
The theorem is conditional on the identity of NodeResult values. It is a theorem about reduction
order, not about worker I/O behavior.
3. Structural Recovery Safety
3.1 Theorem Statement
Let be a frontier result set, where is the subset whose local facts were durably persisted before a crash and is the subset that must be replayed. Define:
Theorem 3 (structural recovery safety). satisfies .
Moreover:
- is a correct resume frontier
- re-closing after replayed facts are accumulated preserves already derived failures
3.2 What the Theorem Does and Does Not Say
The theorem guarantees:
- topology-indexed domain consistency
- removal of transient
Running - closure completeness
- frontier correctness
- schedulable recovery from any persisted accumulation prefix
The theorem does not guarantee:
- identical business outputs after replay
- identical worker exceptions or timing
- end-to-end correctness of the whole workflow system
Those stronger properties require assumptions or proof obligations outside the staged reduction itself.
3.3 Dependency Structure
The theorem depends on four obligations:
applyNodeFactpreserves topology-indexed domainsresetRunningToPendingpreserves domains and removes allRunningpropagateFailureis a closure operator and preserves domain consistencyreadyNodesis sound and complete for the readiness predicate
Only after these are stated explicitly does the recovery theorem stop being circular.
3.4 Proof Status
| Claim | Status in this draft | Intended home |
|---|---|---|
| Frontier antichain | sketchable from DAG readiness | Paper 1 + Paper 2 |
| Disjoint-key commutativity | direct proof | Paper 1 + Paper 2 |
| Closure extensiveness / monotonicity / idempotence | stated as obligations, not fully proved here | Lean mechanization plan |
| Structural recovery safety | stated sharply, proof outline only | Lean mechanization plan + Paper 2 |
| Outcome reproducibility under replay | external assumption | outside the algebra |
The point of the table is discipline: this paper should never silently upgrade “tested in implementation” into “proved in theory.”
4. The Forward Fragment and the Reset Boundary
4.1 Forward Fragment
The forward statuses form the local monotone fragment:
Forward outcomes move upward in this partial order. This fragment supports the cleanest algebraic reading.
4.2 Reset Boundary
Cancellation, shutdown, and crash normalization are explicitly non-monotone. They do not invalidate the staged reduction, but they do invalidate any claim that the runtime is globally monotone.
This separation should remain explicit:
- the forward fragment supports the order-theoretic story
- reset behavior is handled constitutionally by keeping resets node-local and coordinator-controlled
5. Extension Boundary
The staged reduction has three load-bearing invariants:
- I. accumulation commutativity for frontier facts
- II. closure idempotence
- III. frontier antichain / readiness soundness
5.1 Extensions That Preserve the Core
| Extension | I | II | III | Reason |
|---|---|---|---|---|
| new node-local forward outcome | yes | yes | yes | still a point update |
| richer classification output | yes | yes | yes | classification is observational |
| additional closure commuting with failure closure | yes | conditionally | yes | composition of commuting closures |
| edge annotations preserving DAG readiness semantics | yes | yes | conditionally | if readiness stays predecessor-based |
5.2 Extensions That Cross the Boundary
| Extension | Broken invariant | Why |
|---|---|---|
| inter-node mutation in accumulation | I | facts no longer commute by disjoint key |
| compensating or oscillating propagation | II | closure may stop being idempotent |
| topology mutation during accumulation | III | frontier and closure are computed over moving topology |
| shared-key accumulation without a merge algebra | I | update order becomes observable |
Topology-changing execution is therefore not “just another extension.” It requires a different machine and a different safety theorem.
6. Relation to Other Models
6.1 CALM
The CALM theorem remains a useful interpretation:
- accumulation is coordination-free
- classification is a threshold observation
- frontier barriers are the coordination boundary
This is an architectural reading, not a formal reduction of the runtime to CALM.
6.2 CKA and BSP
The runtime resembles a BSP-style fragment of concurrent Kleene algebra:
This is useful for intuition about barriered parallel composition, but we do not claim a full CKA embedding here.
6.3 Actor Confluence Work
Henrio et al.’s layered confluence results for actors are relevant because our system is also a worker-coordinator architecture. The key difference is that our workers are constitutionally restricted to node-local facts, which gives a narrower and stronger commutativity condition than arbitrary actor messaging.
7. Open Problems
- Mechanize the closure laws. This is the main proof debt and the central purpose of the Lean mechanization plan.
- Characterize the closed-state subspace. The closure operator suggests a useful sub-poset of normalized states, but its exact algebraic structure is still only partly understood.
- Weaken replay assumptions. Structural safety needs less than outcome identity. The weakest useful replay-compatibility condition is still open.
- Separate theory for topology-changing execution. Rewrite-capable execution requires a materialization boundary, durable identity scheme, and different recovery theorem. That is the subject of the rewrite materialization and recovery plan.
8. Takeaway
The point of the algebraic foundations paper is not to inflate an implementation sketch into a fake theorem. It is to isolate the exact mathematical kernel of the fixed-topology runtime:
- node-local facts commute by disjoint key
- failure closure is the load-bearing closure operator
- classification is a deterministic read on normalized, closed states
- structural recovery safety follows only after validity is defined explicitly
Everything stronger must either be proved elsewhere, or named honestly as an assumption.
References
- Kuper, L., Newton, R. (2013). LVars: Lattice-based Data Structures for Deterministic Parallelism. FHPC 2013.
- Shapiro, M. et al. (2011). Conflict-free Replicated Data Types. SSS 2011.
- Conway, N. et al. (2012). Logic and Lattices for Distributed Programming. SoCC 2012.
- Hoare, C. A. R., Möller, B., Struth, G., Wehrman, I. (2009). Foundations of Concurrent Kleene Algebra. RelMiCS 2009.
- Valiant, L. G. (1990). A Bridging Model for Parallel Computation. Communications of the ACM 33(8):103–111.
- Henrio, L. et al. (2026). Layers of Confluence for Actors. CPP 2026.