Execution engine

The execution engine is the gateway subsystem responsible for turning a plan or workflow into resilient, auditable execution. It is where reliability guarantees live: retries, idempotency, budgets/timeouts, pause/resume, and evidence capture.

Why it exists

LLMs are good at planning, but they are a poor place to host the control plane for long-running, side-effecting work. The execution engine moves orchestration into a typed runtime so that:

Side effects can be paused behind approvals and resumed safely without repeating completed work.
Runs can be retried deterministically without duplicating actions.
“Done” is backed by postconditions + artifacts, not narrative.
Operator UIs can observe progress in real time via events.

Responsibilities

Queueing and scheduling: accept work from interactive sessions, cron jobs, hooks, and external triggers.
Run state machine: track run lifecycle (queued → running → paused|succeeded|failed|cancelled) with durable persistence.
Step execution: execute steps via the tool runtime and capability providers (nodes, MCP).
Idempotency + safe retries: enforce idempotency_key semantics for side-effecting steps and define retry policies.
Approvals and pause/resume: pause runs when an approval is required and resume using a durable resume token.
Budgets and timeouts: enforce cost/time ceilings per run and per step (including model budgets where applicable).
Concurrency limits: limit parallelism per agent, per lane, per capability provider, and globally.
Evidence and verification: capture artifacts and validate postconditions (required for state-changing steps when feasible).
Rollback metadata: store human-readable rollback hints and optional structured compensation actions (always approval-gated).
Auditability: emit events for run/step lifecycle and persist a run log suitable for troubleshooting and export.

Distributed execution (workers)

The execution engine can run co-located with the gateway edge (even in the same OS process) or be split into separate processes/hosts. To minimize surprises when scaling up, the same execution semantics apply in all deployments: workers claim/lease work in the StateStore and publish lifecycle events through the backplane abstraction (see Scaling and High Availability).

Cluster-safe execution typically requires:

Claim/lease: workers claim work with a time-bounded lease recorded in the StateStore so only one worker executes a given attempt at a time.
Idempotency: side-effecting steps define idempotency_key semantics so retries are safe under at-least-once execution.
Lane serialization: workers acquire a distributed lock/lease keyed by (session_key, lane) before executing steps that must be serialized.
Durable outcomes: attempt results, artifacts, and postcondition evaluations are persisted before emitting “completed” events.

Claimable work items carry explicit lease fields (for example lease_owner and lease_expires_at). Claims are atomic updates, leases are renewed periodically, and takeover occurs safely on expiry.

Lane serialization uses explicit lane lease rows keyed by (session_key, lane) with the same expiry/renew/takeover behavior as work leases.

Idempotency is durable dedupe with cached outcomes: when an executor observes a duplicate (scope, kind, idempotency_key), it returns the stored outcome instead of repeating the side effect.

Retry policy is per-step with conservative defaults. Automatic retries apply only when idempotency semantics are enforced for the step.

Workspace-backed execution (ToolRunner)

Many Tyrum steps are filesystem- or process-oriented (for example running a CLI tool in a workspace, reading/writing files, generating evidence artifacts). To keep TYRUM_HOME durable across runs while still scaling to multi-node clusters, Tyrum treats workspace access as an explicit execution boundary:

ToolRunner is the execution context that mounts the workspace filesystem and runs side-effecting tools.
Workers coordinate work in the StateStore (claims/leases, idempotency, lane serialization) and delegate step execution to ToolRunner.

ToolRunner has deployment-parity implementations:

Single-host/desktop: ToolRunner is a local subprocess (or in-process) operating on the local persistent TYRUM_HOME.
Cluster/Kubernetes: ToolRunner is a sandboxed job/pod that mounts the workspace PVC (RWO) and writes outcomes back to the StateStore.

This keeps execution semantics identical while ensuring that long-lived edge/scheduler replicas do not need to mount shared workspace volumes.

Non-responsibilities

The execution engine does not decide what to do from a user message (planning is in the agent/planner).
The execution engine does not implement device-specific automation (that lives behind node capabilities).
The execution engine does not store raw secrets (that lives behind the secret provider).

Core concepts

Job vs run

Job: the queued unit of work (created by a session request, cron, or hook).
Run: an execution attempt of a job. A job can create multiple runs due to retries or operator-requested replays.

Step and attempt

Step: one atomic action in a workflow (for example “HTTP request”, “click button”, “send message”).
Attempt: one execution attempt of a step (attempt count increments on retry).

Pause/resume

When a run reaches a step that requires approval (or takeover), the engine:

Persists the run in a paused state.
Creates an approval request record.
Returns/emits a resume token that references the paused state.
Resumes only after the approval is resolved (approved/denied/expired).

Resume tokens are opaque identifiers (random ids) that map to paused-state rows in the StateStore. Tokens support expiry and revocation.

Evidence + postconditions (hard rule)

For state-changing steps, a postcondition should be defined whenever a verification check is feasible. The engine is responsible for executing and evaluating the postcondition and storing evidence artifacts.

If a step cannot be verified automatically, the engine must:

Mark the outcome as unverifiable (not “done”), and
Escalate to the operator (approval/takeover) before proceeding with further dependent side effects.

Unverifiable outcomes are represented as a pause with stored reports describing missing evidence; they are not separate terminal statuses.

Postconditions are typed assertion kinds (not arbitrary expression evaluation). The core set stays small and explicit; extensions are registered via plugins/connectors and validated by contracts.

Topology

Data model

jobs(id, created_at, trigger_type, trigger_key, agent_id, lane, status, input, ...)
runs(id, job_id, started_at, finished_at, status, attempt, budgets, ...)
run_steps(id, run_id, index, kind, args, idempotency_key, approval_id?, postcondition, ...)
run_step_attempts(id, run_step_id, attempt, started_at, finished_at, status, result, error, artifacts[])

Exact schemas belong in @tyrum/schemas and exported contracts.

Observability and cost

Structured logs include stable identifiers: request_id, event_id, job_id, run_id, step_id, attempt_id, and approval_id.
Cost attribution (model tokens, executor time) is persisted per run/step/attempt so budgets and approvals can be evaluated and UIs can aggregate accurately.
Deployments export tracing and metrics via OpenTelemetry.

Client/UI expectations

Operator clients should be able to:

See run progress as a timeline (queued/running/paused/completed).
Inspect per-step evidence (artifacts) and postcondition results.
Resolve approvals and resume/cancel paused runs.
Request safe retries or rollbacks when supported.

Why it exists​

Responsibilities​

Distributed execution (workers)​

Workspace-backed execution (ToolRunner)​

Non-responsibilities​

Core concepts​

Job vs run​

Step and attempt​

Pause/resume​

Evidence + postconditions (hard rule)​

Topology​

Data model​

Observability and cost​

Client/UI expectations​