Queue-driven sandbox worker for code execution, test-case evaluation, and static analysis.
Overview#
The execution-plane is a standalone NestJS application context with no HTTP server. It pulls jobs off BullMQ queues, runs user code in an isolated sandbox, and pushes results back to a result queue. The control-plane caches those results in Redis (24h TTL) and serves them to the frontend via polling endpoints.It processes up to 5 concurrent executions per worker instance. Sandbox backends are swappable through the ISandboxProvider port, and the whole thing is wired into OpenTelemetry for traces, metrics, and logs.Architecture#
A circuit breaker on the control-plane side guards against Redis connectivity issues.Queue Contracts#
Run Code#
| Queue | Direction | Purpose |
|---|
execution.run-code | Control-plane --> Execution-plane | Job requests |
execution.run-code.results | Execution-plane --> Control-plane | Job results |
Job payload (existing contract):Result payload (existing contract):Submit Code#
The control-plane owns the fan-out. For N test cases, it enqueues N RunCodeRequest jobs on the same execution.run-code queue, each with the test case's stdin injected. The execution-plane has no concept of submissions or test cases; it just runs code. Aggregation and verdict computation happen in the control-plane.Queue names: Same as Run Code (execution.run-code / execution.run-code.results). No new queues needed.Static Analysis#
Same sandbox infrastructure, but runs linters instead of user code. Covers lint issues, cyclomatic/cognitive complexity, and duplication detection.| Queue | Direction | Purpose |
|---|
execution.analyze | Control-plane --> Execution-plane | Analysis job requests |
execution.analyze.results | Execution-plane --> Control-plane | Analysis results |
AnalyzeCodeResult is the internal queue payload. The HTTP response at GET /rooms/:roomId/analyze/:jobId follows the same polymorphic pattern as execution results: { status: 'queued' | 'running' } while pending, full result fields when done. The control-plane reads the queue payload's status to decide which shape to return.
Linter selection by language:| Language | Linter | Complexity Tool |
|---|
| Python | ruff | radon |
| JavaScript / TypeScript | biome (lint mode) | escomplex |
| Java | checkstyle | PMD |
| C / C++ | cppcheck | lizard |
| Go | golangci-lint | gocyclo |
| Rust | clippy | (built-in) |
Job Schemas#
All job and result types across queues, in one place.Run Code and Execution Client#
Static Analysis and Submission Aggregation#
Sandbox Providers#
Code execution is delegated to whatever ISandboxProvider is wired up in the infrastructure module.Internal execution types (processor-to-sandbox, separate from the contract types):Available Implementations#
| Provider | Description |
|---|---|---|
| E2bSandboxAdapter | E2B Code Interpreter cloud sandboxes. One sandbox per execution, killed on completion. |
| DockerSandboxAdapter | Local Docker containers with per-language images. Needed for languages E2B does not support (C, Go, Rust). |
| KataSandboxAdapter | Kata Containers for stronger isolation in production. |The active provider is selected in the infrastructure module via the ISandboxProvider DI binding.We can replace all of them with the KataSandboxAdapter later when we implement it??Supported Languages#
| Language | Identifier | E2B Support | Docker (planned) | Kata (planned) |
|---|
| Python | python | Yes | Planned | Planned |
| JavaScript | javascript | Yes | Planned | Planned |
| TypeScript | typescript | Yes | Planned | Planned |
| Java | java | Yes | Planned | Planned |
| C++ | cpp | Yes | Planned | Planned |
| C | c | No | Planned | Planned |
| Go | go | No | Planned | Planned |
| Rust | rust | No | Planned | Planned |
The canonical list is defined as SUPPORTED_LANGUAGES in the shared package. E2B currently covers Python, JavaScript, TypeScript, Java, and C++. Jobs for unsupported languages fail permanently (no retry).Resource Limits and Timeouts#
Per-Execution Limits#
| Resource | Default | Maximum | Enforcement |
|---|
| Wall-clock timeout | 30 seconds | 5 minutes (300,000 ms) | Sandbox-level kill |
| Memory | 128 MB | 1,024 MB | Sandbox-level OOM kill |
| Output size | Unlimited | Sandbox-dependent | outputTruncated flag set when exceeded |
| CPU time | No limit | Bounded by wall-clock timeout | Reported via cpuTimeMs |
Two layers of limits are in play. The control-plane caps requests at the HTTP boundary (default timeout 5s, max 30s; default memory 256 MB, max 512 MB). The execution-plane accepts wider ranges (max timeout 300s, max memory 1024 MB) as a safety net. The control-plane limits are what users actually hit.Queue-Level Configuration#
| Setting | Value | Purpose |
|---|
| Concurrency | 5 | Max simultaneous executions per worker instance |
| Lock duration | 330,000 ms (max timeout 300,000 ms + 30,000 ms safety margin) | Must exceed max timeout to prevent stalled-job false positives |
| Stalled interval | 5 sec (default) | How often BullMQ checks for stalled jobs |
| Shutdown timeout | 30 sec (default) | Graceful shutdown wait for in-flight jobs |
Rate Limits (enforced by control-plane)#
| Scope | Limit | Window |
|---|
| Code execution (run) | 10 requests | 1 minute per user |
| Code submission (submit) | 3 requests | 1 minute per user |
Violations return 429 Too Many Requests with a Retry-After header.Result Caching#
Results are cached in Redis by the control-plane so the frontend can poll for them.| Parameter | Value |
|---|
| Cache key format | exec-result:{jobId} |
| TTL | 24 hours |
| Storage | Redis (via ICacheService / RedisCacheAdapter) |
| Write | Control-plane result listener, on receiving a result from the result queue |
| Read | GET /execution/:jobId endpoint |
1.
GET /execution/:jobId checks the cache first.
2.
Cache miss: falls back to IExecutionClient.getJobStatus() to return the queue status (queued | running).
3.
Job not found at all: 404 Not Found (EXECUTION_JOB_NOT_FOUND).
Pending: { status: 'queued' } or { status: 'running' }
Done: Full RunCodeResult with status: 'completed' | 'failed'
Error Handling#
Validation Errors (permanent failures)#
The processor validates each job before touching the sandbox. Validation failures are permanent (no retry), and a failed result is published immediately.| Condition | Error Message |
|---|
Unsupported language (not in SUPPORTED_LANGUAGES) | Unsupported language: {language} |
| Sandbox does not support language | Sandbox does not support language: {language} |
| Empty code | Code cannot be empty |
Sandbox Errors#
Most errors are caught inside the E2B adapter and returned as status: 'failed'. These are permanent failures with no retry. This covers bad user code, runtime exceptions, and runCode failures.The one exception: sandbox creation failures (Sandbox.create()). These propagate to BullMQ and trigger retries (3 attempts, exponential backoff, 1s base delay). Retry policy is set by the control-plane at enqueue time, not by the processor. These failures are transient (E2B API outage, network timeout, etc.).Timeout Handling#
When execution exceeds timeoutMs:1.
The sandbox kills the process.
2.
Result comes back as status: 'failed'.
3.
timedOut is set by heuristic: durationMs >= timeoutMs. Works when the sandbox throws on timeout, but can be false if the process is killed silently.
4.
durationMs reflects actual elapsed time.
With E2B, timedOut: true is best-effort because it relies on runCode throwing and elapsed time meeting timeoutMs. If the sandbox kills the process without throwing (just returns empty output), timedOut stays false. Docker and Kata providers should have more reliable timeout signaling.Circuit Breaker (control-plane side)#
IExecutionClient is wrapped with a circuit breaker proxy. When the circuit opens:POST /rooms/:id/run: 503 (error propagates directly from runCode).
POST /rooms/:id/submit: per-test-case errors in the response array (Promise.all with .catch() per job, never 503).
Observability#
Telemetry Stack#
| Signal | Exporter | Destination |
|---|
| Traces | OTLP/HTTP | Tempo (via OTEL_EXPORTER_OTLP_ENDPOINT) |
| Metrics | OTLP/HTTP (60s interval) | Prometheus (via OTLP receiver) |
| Logs | pino-opentelemetry-transport | Loki (via OTLP receiver) |
Key Metrics#
| Metric | Type | Labels | Purpose |
|---|
| Queue depth (waiting + active) | Gauge | queue | Capacity monitoring |
| Job processing duration | Histogram | queue, language, status | Latency tracking |
| Job failure rate | Counter | queue, language, error_type | Reliability monitoring |
| Language distribution | Counter | language | Usage analytics |
| Sandbox creation latency | Histogram | provider | Provider health |
| Active sandboxes | Gauge | provider | Concurrency monitoring |
Environment Variables#
| Variable | Required | Default | Description |
|---|
REDIS_URL | No | redis://localhost:6379 | Redis connection for BullMQ |
E2B_API_KEY | Yes | -- | E2B Code Interpreter API key |
OTEL_EXPORTER_OTLP_ENDPOINT | No | -- | OTLP endpoint (disables telemetry if unset) |
NODE_ENV | No | development | Environment mode |
Modified at 2026-03-12 05:26:10