Execution-Plane

Queue-driven sandbox worker for code execution, test-case evaluation, and static analysis.

Overview

The execution-plane is a standalone NestJS application context with no HTTP server. It pulls jobs off BullMQ queues, runs user code in an isolated sandbox, and pushes results back to a result queue. The control-plane caches those results in Redis (24h TTL) and serves them to the frontend via polling endpoints.

It processes up to 5 concurrent executions per worker instance. Sandbox backends are swappable through the ISandboxProvider port, and the whole thing is wired into OpenTelemetry for traces, metrics, and logs.

Architecture

A circuit breaker on the control-plane side guards against Redis connectivity issues.

Queue Contracts

Run Code

Queue names:

Queue	Direction	Purpose
`execution.run-code`	Control-plane --> Execution-plane	Job requests
`execution.run-code.results`	Execution-plane --> Control-plane	Job results

Sequence:

Job payload (existing contract):

Result payload (existing contract):

Submit Code

The control-plane owns the fan-out. For N test cases, it enqueues N RunCodeRequest jobs on the same execution.run-code queue, each with the test case's stdin injected. The execution-plane has no concept of submissions or test cases; it just runs code. Aggregation and verdict computation happen in the control-plane.

Queue names: Same as Run Code (execution.run-code / execution.run-code.results). No new queues needed.

Sequence:

Submission result:

Static Analysis

Same sandbox infrastructure, but runs linters instead of user code. Covers lint issues, cyclomatic/cognitive complexity, and duplication detection.

Queue names:

Queue	Direction	Purpose
`execution.analyze`	Control-plane --> Execution-plane	Analysis job requests
`execution.analyze.results`	Execution-plane --> Control-plane	Analysis results

Sequence:

Job schemas:

AnalyzeCodeResult is the internal queue payload. The HTTP response at GET /rooms/:roomId/analyze/:jobId follows the same polymorphic pattern as execution results: { status: 'queued' | 'running' } while pending, full result fields when done. The control-plane reads the queue payload's status to decide which shape to return.

Linter selection by language:

Language	Linter	Complexity Tool
Python	ruff	radon
JavaScript / TypeScript	biome (lint mode)	escomplex
Java	checkstyle	PMD
C / C++	cppcheck	lizard
Go	golangci-lint	gocyclo
Rust	clippy	(built-in)

Job Schemas

All job and result types across queues, in one place.

Run Code and Execution Client

Static Analysis and Submission Aggregation

Sandbox Providers

Code execution is delegated to whatever ISandboxProvider is wired up in the infrastructure module.

Port interface:

Internal execution types (processor-to-sandbox, separate from the contract types):

Available Implementations

| Provider | Description |
|---|---|---|
| E2bSandboxAdapter | E2B Code Interpreter cloud sandboxes. One sandbox per execution, killed on completion. |
| DockerSandboxAdapter | Local Docker containers with per-language images. Needed for languages E2B does not support (C, Go, Rust). |
| KataSandboxAdapter | Kata Containers for stronger isolation in production. |

The active provider is selected in the infrastructure module via the ISandboxProvider DI binding.

We can replace all of them with the KataSandboxAdapter later when we implement it??

Supported Languages

Language	Identifier	E2B Support	Docker (planned)	Kata (planned)
Python	`python`	Yes	Planned	Planned
JavaScript	`javascript`	Yes	Planned	Planned
TypeScript	`typescript`	Yes	Planned	Planned
Java	`java`	Yes	Planned	Planned
C++	`cpp`	Yes	Planned	Planned
C	`c`	No	Planned	Planned
Go	`go`	No	Planned	Planned
Rust	`rust`	No	Planned	Planned

The canonical list is defined as SUPPORTED_LANGUAGES in the shared package. E2B currently covers Python, JavaScript, TypeScript, Java, and C++. Jobs for unsupported languages fail permanently (no retry).

Resource Limits and Timeouts

Per-Execution Limits

Resource	Default	Maximum	Enforcement
Wall-clock timeout	30 seconds	5 minutes (300,000 ms)	Sandbox-level kill
Memory	128 MB	1,024 MB	Sandbox-level OOM kill
Output size	Unlimited	Sandbox-dependent	`outputTruncated` flag set when exceeded
CPU time	No limit	Bounded by wall-clock timeout	Reported via `cpuTimeMs`

Two layers of limits are in play. The control-plane caps requests at the HTTP boundary (default timeout 5s, max 30s; default memory 256 MB, max 512 MB). The execution-plane accepts wider ranges (max timeout 300s, max memory 1024 MB) as a safety net. The control-plane limits are what users actually hit.

Queue-Level Configuration

Setting	Value	Purpose
Concurrency	5	Max simultaneous executions per worker instance
Lock duration	330,000 ms (max timeout 300,000 ms + 30,000 ms safety margin)	Must exceed max timeout to prevent stalled-job false positives
Stalled interval	5 sec (default)	How often BullMQ checks for stalled jobs
Shutdown timeout	30 sec (default)	Graceful shutdown wait for in-flight jobs

Rate Limits (enforced by control-plane)

Scope	Limit	Window
Code execution (run)	10 requests	1 minute per user
Code submission (submit)	3 requests	1 minute per user

Violations return 429 Too Many Requests with a Retry-After header.

Result Caching

Results are cached in Redis by the control-plane so the frontend can poll for them.

Parameter	Value
Cache key format	`exec-result:{jobId}`
TTL	24 hours
Storage	Redis (via `ICacheService` / `RedisCacheAdapter`)
Write	Control-plane result listener, on receiving a result from the result queue
Read	`GET /execution/:jobId` endpoint

Polling behavior:

GET /execution/:jobId checks the cache first.

Cache miss: falls back to IExecutionClient.getJobStatus() to return the queue status (queued | running).

Job not found at all: 404 Not Found (EXECUTION_JOB_NOT_FOUND).

Polymorphic response:

Pending: { status: 'queued' } or { status: 'running' }

Done: Full RunCodeResult with status: 'completed' | 'failed'

Error Handling

Validation Errors (permanent failures)

The processor validates each job before touching the sandbox. Validation failures are permanent (no retry), and a failed result is published immediately.

Condition	Error Message
Unsupported language (not in `SUPPORTED_LANGUAGES`)	`Unsupported language: {language}`
Sandbox does not support language	`Sandbox does not support language: {language}`
Empty code	`Code cannot be empty`

Sandbox Errors

Most errors are caught inside the E2B adapter and returned as status: 'failed'. These are permanent failures with no retry. This covers bad user code, runtime exceptions, and runCode failures.

The one exception: sandbox creation failures (Sandbox.create()). These propagate to BullMQ and trigger retries (3 attempts, exponential backoff, 1s base delay). Retry policy is set by the control-plane at enqueue time, not by the processor. These failures are transient (E2B API outage, network timeout, etc.).

Timeout Handling

When execution exceeds timeoutMs:

The sandbox kills the process.

Result comes back as status: 'failed'.

timedOut is set by heuristic: durationMs >= timeoutMs. Works when the sandbox throws on timeout, but can be false if the process is killed silently.

durationMs reflects actual elapsed time.

With E2B, timedOut: true is best-effort because it relies on runCode throwing and elapsed time meeting timeoutMs. If the sandbox kills the process without throwing (just returns empty output), timedOut stays false. Docker and Kata providers should have more reliable timeout signaling.

Circuit Breaker (control-plane side)

IExecutionClient is wrapped with a circuit breaker proxy. When the circuit opens:

POST /rooms/:id/run: 503 (error propagates directly from runCode).

POST /rooms/:id/submit: per-test-case errors in the response array (Promise.all with .catch() per job, never 503).

Observability

Telemetry Stack

Signal	Exporter	Destination
Traces	OTLP/HTTP	Tempo (via `OTEL_EXPORTER_OTLP_ENDPOINT`)
Metrics	OTLP/HTTP (60s interval)	Prometheus (via OTLP receiver)
Logs	pino-opentelemetry-transport	Loki (via OTLP receiver)

Key Metrics

Metric	Type	Labels	Purpose
Queue depth (waiting + active)	Gauge	`queue`	Capacity monitoring
Job processing duration	Histogram	`queue`, `language`, `status`	Latency tracking
Job failure rate	Counter	`queue`, `language`, `error_type`	Reliability monitoring
Language distribution	Counter	`language`	Usage analytics
Sandbox creation latency	Histogram	`provider`	Provider health
Active sandboxes	Gauge	`provider`	Concurrency monitoring

Environment Variables

Variable	Required	Default	Description
`REDIS_URL`	No	`redis://localhost:6379`	Redis connection for BullMQ
`E2B_API_KEY`	Yes	--	E2B Code Interpreter API key
`OTEL_EXPORTER_OTLP_ENDPOINT`	No	--	OTLP endpoint (disables telemetry if unset)
`NODE_ENV`	No	`development`	Environment mode

Execution-Plane

Overview#

Architecture#

Queue Contracts#

Run Code#

Submit Code#

Static Analysis#

Job Schemas#

Run Code and Execution Client#

Static Analysis and Submission Aggregation#

Sandbox Providers#

Available Implementations#

Supported Languages#

Resource Limits and Timeouts#

Per-Execution Limits#

Queue-Level Configuration#

Rate Limits (enforced by control-plane)#

Result Caching#

Error Handling#

Validation Errors (permanent failures)#

Sandbox Errors#

Timeout Handling#

Circuit Breaker (control-plane side)#

Observability#

Telemetry Stack#

Key Metrics#

Environment Variables#

Overview

Architecture

Queue Contracts

Run Code

Submit Code

Static Analysis

Job Schemas

Run Code and Execution Client

Static Analysis and Submission Aggregation

Sandbox Providers

Available Implementations

Supported Languages

Resource Limits and Timeouts

Per-Execution Limits

Queue-Level Configuration

Rate Limits (enforced by control-plane)

Result Caching

Error Handling

Validation Errors (permanent failures)

Sandbox Errors

Timeout Handling

Circuit Breaker (control-plane side)

Observability

Telemetry Stack

Key Metrics

Environment Variables