Software Factory - AI Agent Runtime and PR Sandbox

# Software Factory - AI Agent Runtime and PR Sandbox Backlink: [[2026-05-20]] ## One-line thesis The world probably does not need another generic AI coding agent. It needs a **factory layer around coding agents** that gives every AI-written change a real development environment, browser feedback loop, preview URL, test evidence, video proof, code review, and a clean GitHub PR. In other words: > AI coding agents create demand for a new kind of platform engineering: environments and feedback loops built for machines, not just humans. > > “Given a fresh machine and this repo, how do I make the app runnable and testable?" ## The product vision A **software factory** is a system where each project has a Telegram chat. A user can send an idea or feature request into that chat, and the system automatically: 1. Understands the request. 2. Creates an isolated branch/worktree or sandbox. 3. Starts the project-specific dev environment. 4. Lets an AI coding agent implement the change. 5. Gives the agent a browser/app runtime so it can test and debug what it wrote. 6. Runs automated tests. 7. Records a video or screenshots proving the feature was exercised. 8. Opens a GitHub PR. 9. Adds a preview link, video artifact, test summary, and implementation summary to the PR. 10. Triggers a second review bot to review the code. 11. Posts the final PR, preview link, test evidence, and reviewer verdict back into Telegram. The desired human experience: > I type the idea into the project chat. The factory returns a reviewable PR with proof that the app actually ran. ## Why this matters now AI can increasingly produce plausible code. But plausible code is not the same as working software. The real bottleneck is shifting from: > Can AI write code? To: > Can AI run the app, observe failure, fix its own mistakes, produce proof, and hand humans a trustworthy PR? Developers are already using AI heavily, but they do not fully trust it. Stack Overflow's 2025 Developer Survey reported that: - 84% of respondents use or plan to use AI tools in development. - 51% of professional developers use AI tools daily. - Only about 33% trust AI accuracy, while about 46% distrust it. - The biggest frustration, cited by 66%, is AI solutions that are "almost right, but not quite." - 87% of agent users are concerned about AI-agent accuracy. - 81% are concerned about security and privacy. This indicates a real trust gap. The product should not be framed as "AI writes code." That part is becoming commoditized. It should be framed as: > AI PRs you can trust: every request becomes a PR with a live preview, test evidence, video validation, and automated review. ## The deeper pain point The obvious pain point is that humans want to see the output of AI-written code. The deeper pain point is that the AI itself needs an executable feedback loop while it is coding. A good coding agent needs to: 1. Edit code. 2. Install dependencies. 3. Build the app. 4. Start the required services. 5. Run tests. 6. Open the browser or app. 7. Try the feature. 8. Observe errors and logs. 9. Fix the code. 10. Repeat until the feature works. Without a real running environment, the agent is coding in the dark. This is one reason AI code often feels "almost right, but not quite." The model can generate a plausible diff without truly exercising the product. So the product is not merely "Telegram to PR." It is: > A runtime feedback environment for coding agents. Or more specifically: > Per-PR development sandboxes for AI agents, with live app execution, browser feedback, video validation, and PR packaging. ## Did this problem exist before AI? Yes. This problem existed before AI in several forms. ### 1. Local development setup Teams have always had the problem: > It works on my machine. Or: > The new engineer took three days to get the repo running. This involved: - installing dependencies, - setting up databases, - configuring env vars, - running migrations, - starting background workers, - keeping local dev close enough to production, - testing frontend and backend together, - connecting to internal services, - and keeping setup docs current. A script like `dev.sh` is the classic response: encode tribal knowledge into a repeatable command. ### 2. Staging environments Traditional teams often used shared staging environments: ```text dev -> staging -> production ``` But shared staging has limitations: - Developers block each other. - One broken branch can break staging. - Test data becomes messy. - Staging may not match the branch being reviewed. - It is hard to test many PRs concurrently. ### 3. Preview environments / review apps Pre-AI, many companies introduced one environment per PR or branch. Examples include: - Heroku Review Apps, - Vercel Preview Deployments, - Netlify Deploy Previews, - Render preview environments, - Railway environments, - Fly.io apps, - Cloudflare Pages previews, - Kubernetes ephemeral environments. The idea was: > Every branch or PR should get a running app URL. This is close to the software factory idea, but it was designed mainly for human developers and reviewers, not autonomous coding agents. ### 4. CI/CD CI/CD systems already handled parts of the workflow: - building, - linting, - testing, - deploying, - artifact collection, - status checks. Examples: - Jenkins, - GitHub Actions, - GitLab CI, - CircleCI, - Buildkite, - Harness, - Travis CI, - ArgoCD, - Spinnaker. But CI/CD usually runs **after** code is written and pushed. AI agents need feedback **while** writing the code. That is the key difference. ### 5. Internal Developer Platforms Larger companies built internal platforms so developers could self-serve: - create services, - create environments, - deploy branches, - run tests, - access logs, - get databases, - get secrets, - monitor apps. This became the world of: - Internal Developer Platforms, - Platform Engineering, - Developer Experience, - Developer Productivity Engineering. Examples/tools in this broader space include: - Backstage, - Humanitec, - Port, - Cortex, - OpsLevel, - Qovery, - Okteto, - Northflank, - Coder, - Gitpod, - GitHub Codespaces, - Terraform Cloud, - Kubernetes-based internal platforms. So the problem is not new. What is new is that AI agents need these systems to be explicit, machine-readable, deterministic, and available per task. ## Who owned this problem pre-AI? Several roles owned pieces of this problem. ### DevOps engineers They handled: - CI/CD pipelines, - deployment scripts, - Dockerfiles, - Kubernetes manifests, - infrastructure automation, - release processes, - monitoring, - production reliability. The software factory overlaps with DevOps, but focuses more on development-time automation and agent feedback loops, not only production deployment. ### Platform engineers This is probably the closest existing role. Platform engineers build internal systems so developers can self-serve. They own: - internal developer platforms, - service templates, - deployment workflows, - preview environments, - secrets access, - Kubernetes abstractions, - dev portals, - golden paths, - environment provisioning, - CI/CD templates. A software factory augments platform engineers by turning the platform into something AI agents can use. ### DevEx / Developer Experience engineers They focus on making developers faster and less frustrated: - local dev setup, - onboarding, - build speed, - test speed, - documentation, - CLIs, - dev containers, - preview environments, - internal dashboards. The software factory is essentially **DevEx for humans and AI agents together**. ### Developer Productivity engineers They work on: - build systems, - test systems, - monorepo tooling, - CI performance, - code generation, - local development tools, - dependency management. The software factory would make their tooling usable by autonomous agents. ### Release engineers They own: - build artifacts, - versioning, - release branches, - release notes, - promotion between environments, - rollback processes. The software factory touches release engineering, but earlier: PR creation and validation. ### QA automation engineers They own: - E2E tests, - Selenium/Cypress/Playwright, - regression suites, - test plans, - screenshots/videos, - test environments. The software factory brings a first-pass QA loop into every AI-generated PR. ### SREs They focus on production reliability: - observability, - incidents, - SLOs, - production automation, - capacity, - reliability. The software factory may reduce production risk by catching bad changes earlier, but it does not replace SRE. ### Senior engineers / tech leads In smaller teams, the senior engineer or tech lead often writes: - `dev.sh`, - `Makefile`, - Docker Compose setup, - onboarding docs, - CI workflow, - seed scripts, - deployment scripts. This is exactly the pattern with Diffie. ## What does this augment or replace? The software factory mostly **augments**: - platform engineers, - DevOps engineers, - DevEx teams, - developer productivity teams, - QA automation engineers, - senior engineers and tech leads. It may partially replace: - manual PR preparation, - manual smoke testing, - some basic QA passes, - custom glue scripts, - repetitive environment setup, - repetitive implementation work for small tickets. It probably does **not** replace: - strong platform engineering, - senior architecture judgment, - product engineering judgment, - SRE, - serious human code review, - security/governance ownership. The human role shifts from implementation operator to orchestrator, reviewer, product thinker, and platform designer. ## Why server infrastructure companies only partially solved this Cloud providers and hosting platforms solved parts of the stack, but not the full software factory loop. ### Cloud providers AWS, GCP, Azure, and DigitalOcean provide raw infrastructure: VMs, databases, networking, containers, storage. They do not know how to run a specific repo. ### PaaS companies Heroku, Render, Railway, Fly.io, and Northflank make deployment easier and may support preview environments. But they still require the app to fit their model. A complex multi-service repo still needs project-specific orchestration. ### Frontend preview companies Vercel, Netlify, and Cloudflare Pages are excellent for frontend preview deployments. But many real apps need: - backend services, - databases, - queues, - workers, - auth, - migrations, - internal APIs, - service registration, - seed data, - browser automation. So they solve an important slice, but not the full "agent can run the whole app and debug it" problem. ### Dev environment companies GitHub Codespaces, Gitpod, Coder, DevPod, Daytona, and Dev Containers provide cloud/dev environments. They provide the **box**. They do not automatically know the app lifecycle: - start this service first, - wait for this healthcheck, - allocate these ports, - register this service, - expose this URL, - run this validation, - record this video, - open this PR with evidence. That knowledge still needs to live in a project-specific runtime contract. ## The Diffie example Diffie already demonstrates the problem and the solution direction. The current `dev.sh` is not just a convenience script. It is an encoded development environment contract. It handles: - loading `.env.dev`, - applying worktree-specific overrides, - setting `DATABASE_URL` based on the worktree Postgres port, - setting `ELECTRIC_URL` based on the worktree Electric port, - starting Docker Desktop on macOS, - assuming systemd-managed Docker on Linux, - starting Postgres and Electric SQL via Docker Compose, - supporting unique Docker Compose project names for worktree isolation, - waiting for Postgres readiness, - installing dependencies with `bun install`, - pushing schema with `bun run dev:push`, - allocating dynamic ports from a configurable range, - writing dynamic `.dev.vars` for realtime service auth, - deriving Restate fabric/admin/ingress ports, - starting Restate, - waiting for Restate health, - building and starting the test service, - registering the test service with Restate, - stopping temporary registration processes, - handing off long-lived services to `concurrently`, - starting auth, restate, api, test service, realtime, explore, app, and website, - exporting all required env vars to child processes, - printing app, API, explore, realtime, Electric, Restate, website, and tailnet URLs. This is exactly the kind of complexity that a coding agent needs handled before it can effectively develop a feature. A generic sandbox does not know any of this. A sandbox gives the agent a place to run. Diffie's `dev.sh` tells the agent **how to make Diffie runnable**. ## Would a sandbox remove the need for `dev.sh`? No. A sandbox gives isolation. It does not give understanding. The sandbox can provide: - a clean machine, - filesystem, - CPU/RAM, - Docker, - browser, - network, - maybe secrets, - maybe snapshotting, - maybe public preview URL. But it does not know: - which services to start, - which dependency manager to use, - which database is needed, - which migrations to run, - which ports matter, - which frontend URL is the real app, - which background worker must run, - how to seed data, - how to run tests, - how to know the app is ready, - how to expose the preview, - how to clean up. So even with one sandbox per PR, we still need something like `dev.sh`. But the role changes. `dev.sh` becomes less of a laptop startup script and more of a machine-readable runtime contract. The evolution should be: ```text local dev.sh -> runtime contract -> sandbox runtime ``` ## The ideal factory interface Each repo should expose standard commands like: ```bash factory setup factory start factory status factory test factory validate factory urls factory stop factory clean ``` Or: ```bash ./factory.sh setup ./factory.sh start --port-range 46200 --worktree-id job_123 ./factory.sh status --json ./factory.sh urls --json ./factory.sh test ./factory.sh validate --feature "user can create project" ./factory.sh stop ``` This gives AI agents a stable contract. They do not need to understand every internal detail of Diffie. They need reliable commands, healthchecks, URLs, logs, and artifacts. ## Possible `factory.yaml` shape ```yaml name: diffie runtime: package_manager: bun setup: - bun install env_file: .env.dev isolation: strategy: git-worktree port_range_size: 100 docker_project_prefix: diffie_factory services: postgres: type: docker-compose compose_file: packages/common/docker-compose.yaml healthcheck: docker compose exec -T postgres pg_isready -U postgres electric: type: docker-compose restate: command: restate-server healthcheck: curl -sf http://localhost:${RESTATE_ADMIN_PORT}/health api: command: cd packages/api && bun run dev healthcheck: curl -sf http://localhost:${API_PORT}/health app: command: cd packages/app && bun run dev url: http://localhost:${APP_PORT} website: command: cd packages/website && bun run dev -- --hostname 0.0.0.0 --port ${WEBSITE_PORT} url: http://localhost:${WEBSITE_PORT} tests: unit: - bun test typecheck: - bun run typecheck validation: browser: start_url: http://localhost:${APP_PORT} record_video: true artifacts: collect: - logs/** - test-results/** - screenshots/** - videos/** ``` This would be the formalized version of the current `dev.sh` knowledge. ## The architecture of the software factory The product has three layers. ### 1. Sandbox layer Provides isolated compute per PR/job. Possible backends: - local machine with git worktrees, - Docker containers, - Firecracker/microVMs, - Modal, - E2B, - Daytona, - Codespaces, - devcontainers, - Kubernetes jobs, - remote CPU/GPU workers if needed. ### 2. Runtime contract layer Project-specific instructions for making the app runnable. Includes: - setup, - start, - healthchecks, - URLs, - tests, - validation, - cleanup, - artifact collection, - logs. This is the evolved `dev.sh`. ### 3. Agent workflow layer Coordinates: - Telegram intake, - planning, - coding, - running tests, - browser testing, - video capture, - PR creation, - review bot, - Telegram updates. The durable product may be layer 2 + layer 3, with the sandbox provider being pluggable. ## State machine for a software factory job A job should move through explicit states: - `received` - `planned` - `worktree_created` - `sandbox_created` - `dev_environment_starting` - `dev_environment_ready` - `implementation_in_progress` - `tests_running` - `browser_validation_running` - `video_recorded` - `pr_opened` - `review_running` - `ready_for_human_review` - `merged` - `failed` - `cancelled` Each state should record: - timestamps, - logs, - responsible agent, - retry policy, - artifact links, - Telegram status updates. ## Roles inside the factory ### Planner Reads the Telegram request, repo context, docs, issues, and existing code. Produces a scoped implementation plan and acceptance criteria. ### Builder Implements the change inside the isolated worktree/sandbox. ### Environment keeper Starts and monitors dev processes, checks health, exposes preview URLs, and cleans up resources. ### Tester / recorder Uses browser or mobile automation to exercise the feature and record a video or screenshots. ### Reviewer Performs first-pass code review, flags risks, checks style/security/regressions, and comments on the PR. ### Notifier Keeps Telegram updated with progress, failures, PR links, preview links, and final artifacts. ## GitHub PR contract Every factory-generated PR should include: - original Telegram request link/message excerpt, - problem statement, - implementation summary, - files changed / architecture notes, - acceptance criteria checklist, - test results, - preview link, - video artifact link, - reviewer bot summary, - known risks, - rollback notes. Labels could include: - `factory-generated`, - `needs-human-review`, - project label, - risk level label. ## Safety and controls The factory needs guardrails: - explicit allowlist mapping Telegram chats to repos, - no arbitrary chat can trigger arbitrary code execution, - secrets managed through existing secret stores, - no secrets pasted into PRs or Telegram, - concurrency limits per repo and per machine, - cancellation command from Telegram, - automatic cleanup of old worktrees/processes/containers, - port/process leak detection, - no force-pushing protected branches, - human approval for destructive migrations, production deploys, billing changes, or customer-impacting changes. ## MVP path Start with one pilot repo: Diffie. 1. Formalize Diffie's `dev.sh` and `scripts/worktree.sh` behavior into a runtime contract. 2. Define `software-factory.yaml` or `factory.yaml`. 3. Create Telegram intake for one project chat. 4. Implement job DB/state tracking. 5. Implement worktree creation and dynamic port allocation. 6. Start the Diffie dev environment in an isolated worktree/sandbox. 7. Capture health status and preview URLs. 8. Run a simple coding-agent task on a tiny UI/API change. 9. Open a GitHub PR automatically. 10. Add browser validation and screen recording. 11. Add reviewer bot pass. 12. Post final summary back to Telegram. ## Initial acceptance criteria The MVP works when: - A Telegram message in the project chat creates a GitHub PR without manual CLI intervention. - The PR includes a working preview link for that branch/worktree/sandbox. - The PR includes a video or screenshot artifact proving the validation agent opened and tested the feature. - A second review bot leaves a review/check on the PR. - Multiple jobs can run concurrently without port, Docker project, database, or branch collisions. - Failed jobs report the failing stage, logs, and cleanup status back to Telegram. ## Competitive context Existing AI coding agents include: - Cursor, - Claude Code, - Codex, - Devin, - GitHub Copilot agent/workspace, - Jules, - OpenCode, - Windsurf-style agents. If the product is "send a prompt, get code," it is too crowded. The sharper wedge is: > Every AI-generated PR comes with a live preview, proof-of-test video, automated code review, and a reproducible isolated dev environment. Possible positioning: - AI-native Internal Developer Platform. - Agent Runtime Platform for Software Engineering. - Per-PR Development Sandbox Platform. - Trust Layer for AI-generated Pull Requests. - Software Factory Orchestrator. Best crisp line: > Your AI coding agents are useless unless they can run the app. We give every agent a working dev environment, browser, preview URL, and PR validation loop. ## Who might buy it? Likely buyers/users: - CTOs at AI-forward startups, - heads of engineering, - platform engineering leads, - DevEx leads, - engineering productivity teams, - agencies building many web/mobile features, - companies adopting Codex/Claude Code/Devin/Cursor agents, - teams with complex local dev setup, - teams with many PRs and slow review cycles. This is most valuable for projects with: - multiple services, - nontrivial local setup, - web/mobile UI, - recurring feature requests, - many small-to-medium product changes, - many repos, - desire to use AI agents in parallel, - need for reviewability and auditability. It is less valuable for: - tiny scripts, - greenfield prototypes, - static sites, - solo developers comfortable with local setup, - teams whose CI/CD preview system already handles everything well. ## Key insight to remember Pre-AI, humans absorbed missing context. They could ask a teammate, read docs, infer a missing step, restart a service, or manually inspect the browser. Post-AI, the environment must be explicit enough for a machine. AI turns an informal human workflow into a required machine-readable protocol. The software factory should be the protocol and execution layer for: ```text request -> branch -> sandbox -> running app -> feedback -> test -> video -> PR -> review ``` ## Open questions - Should the factory be a Hermes webhook/cron workflow, a dedicated long-running service, or a hybrid? - Where should job state live: project-local SQLite, Anand's personal SQLite, or a hosted DB? - Which agent should be the default builder: Codex CLI, Claude Code, OpenCode, or a router? - What is the first preview exposure mechanism: tailnet, local tunnel, Vercel/Cloudflare, or project-specific dev URL? - How strict should human approval gates be before PR creation vs before merge/deploy? - Should this begin as internal tooling for Diffie or as a SaaS offering after the Diffie pilot? - Is the core product the sandbox provider, the runtime contract, or the orchestrator that coordinates existing tools? ## Next concrete step Write the Diffie pilot implementation plan: 1. Define `software-factory.yaml`. 2. Document the worktree/dev environment contract around `dev.sh`. 3. Design the job-state schema. 4. Define Telegram commands and status messages. 5. Define the PR template. 6. Choose browser/video recording mechanism. 7. Create a first end-to-end demo task that changes something small and opens a PR. Related todo in the personal task DB: **Build software factory agent for mobile feature requests and PR previews**.