# Software Factory - AI Agent Runtime and PR Sandbox
Backlink: [[2026-05-20]]
## One-line thesis
The world probably does not need another generic AI coding agent. It needs a **factory layer around coding agents** that gives every AI-written change a real development environment, browser feedback loop, preview URL, test evidence, video proof, code review, and a clean GitHub PR.
In other words:
> AI coding agents create demand for a new kind of platform engineering: environments and feedback loops built for machines, not just humans.
>
> “Given a fresh machine and this repo, how do I make the app runnable and testable?"
## The product vision
A **software factory** is a system where each project has a Telegram chat. A user can send an idea or feature request into that chat, and the system automatically:
1. Understands the request.
2. Creates an isolated branch/worktree or sandbox.
3. Starts the project-specific dev environment.
4. Lets an AI coding agent implement the change.
5. Gives the agent a browser/app runtime so it can test and debug what it wrote.
6. Runs automated tests.
7. Records a video or screenshots proving the feature was exercised.
8. Opens a GitHub PR.
9. Adds a preview link, video artifact, test summary, and implementation summary to the PR.
10. Triggers a second review bot to review the code.
11. Posts the final PR, preview link, test evidence, and reviewer verdict back into Telegram.
The desired human experience:
> I type the idea into the project chat. The factory returns a reviewable PR with proof that the app actually ran.
## Why this matters now
AI can increasingly produce plausible code. But plausible code is not the same as working software.
The real bottleneck is shifting from:
> Can AI write code?
To:
> Can AI run the app, observe failure, fix its own mistakes, produce proof, and hand humans a trustworthy PR?
Developers are already using AI heavily, but they do not fully trust it. Stack Overflow's 2025 Developer Survey reported that:
- 84% of respondents use or plan to use AI tools in development.
- 51% of professional developers use AI tools daily.
- Only about 33% trust AI accuracy, while about 46% distrust it.
- The biggest frustration, cited by 66%, is AI solutions that are "almost right, but not quite."
- 87% of agent users are concerned about AI-agent accuracy.
- 81% are concerned about security and privacy.
This indicates a real trust gap.
The product should not be framed as "AI writes code." That part is becoming commoditized. It should be framed as:
> AI PRs you can trust: every request becomes a PR with a live preview, test evidence, video validation, and automated review.
## The deeper pain point
The obvious pain point is that humans want to see the output of AI-written code.
The deeper pain point is that the AI itself needs an executable feedback loop while it is coding.
A good coding agent needs to:
1. Edit code.
2. Install dependencies.
3. Build the app.
4. Start the required services.
5. Run tests.
6. Open the browser or app.
7. Try the feature.
8. Observe errors and logs.
9. Fix the code.
10. Repeat until the feature works.
Without a real running environment, the agent is coding in the dark. This is one reason AI code often feels "almost right, but not quite." The model can generate a plausible diff without truly exercising the product.
So the product is not merely "Telegram to PR." It is:
> A runtime feedback environment for coding agents.
Or more specifically:
> Per-PR development sandboxes for AI agents, with live app execution, browser feedback, video validation, and PR packaging.
## Did this problem exist before AI?
Yes. This problem existed before AI in several forms.
### 1. Local development setup
Teams have always had the problem:
> It works on my machine.
Or:
> The new engineer took three days to get the repo running.
This involved:
- installing dependencies,
- setting up databases,
- configuring env vars,
- running migrations,
- starting background workers,
- keeping local dev close enough to production,
- testing frontend and backend together,
- connecting to internal services,
- and keeping setup docs current.
A script like `dev.sh` is the classic response: encode tribal knowledge into a repeatable command.
### 2. Staging environments
Traditional teams often used shared staging environments:
```text
dev -> staging -> production
```
But shared staging has limitations:
- Developers block each other.
- One broken branch can break staging.
- Test data becomes messy.
- Staging may not match the branch being reviewed.
- It is hard to test many PRs concurrently.
### 3. Preview environments / review apps
Pre-AI, many companies introduced one environment per PR or branch.
Examples include:
- Heroku Review Apps,
- Vercel Preview Deployments,
- Netlify Deploy Previews,
- Render preview environments,
- Railway environments,
- Fly.io apps,
- Cloudflare Pages previews,
- Kubernetes ephemeral environments.
The idea was:
> Every branch or PR should get a running app URL.
This is close to the software factory idea, but it was designed mainly for human developers and reviewers, not autonomous coding agents.
### 4. CI/CD
CI/CD systems already handled parts of the workflow:
- building,
- linting,
- testing,
- deploying,
- artifact collection,
- status checks.
Examples:
- Jenkins,
- GitHub Actions,
- GitLab CI,
- CircleCI,
- Buildkite,
- Harness,
- Travis CI,
- ArgoCD,
- Spinnaker.
But CI/CD usually runs **after** code is written and pushed. AI agents need feedback **while** writing the code. That is the key difference.
### 5. Internal Developer Platforms
Larger companies built internal platforms so developers could self-serve:
- create services,
- create environments,
- deploy branches,
- run tests,
- access logs,
- get databases,
- get secrets,
- monitor apps.
This became the world of:
- Internal Developer Platforms,
- Platform Engineering,
- Developer Experience,
- Developer Productivity Engineering.
Examples/tools in this broader space include:
- Backstage,
- Humanitec,
- Port,
- Cortex,
- OpsLevel,
- Qovery,
- Okteto,
- Northflank,
- Coder,
- Gitpod,
- GitHub Codespaces,
- Terraform Cloud,
- Kubernetes-based internal platforms.
So the problem is not new. What is new is that AI agents need these systems to be explicit, machine-readable, deterministic, and available per task.
## Who owned this problem pre-AI?
Several roles owned pieces of this problem.
### DevOps engineers
They handled:
- CI/CD pipelines,
- deployment scripts,
- Dockerfiles,
- Kubernetes manifests,
- infrastructure automation,
- release processes,
- monitoring,
- production reliability.
The software factory overlaps with DevOps, but focuses more on development-time automation and agent feedback loops, not only production deployment.
### Platform engineers
This is probably the closest existing role.
Platform engineers build internal systems so developers can self-serve. They own:
- internal developer platforms,
- service templates,
- deployment workflows,
- preview environments,
- secrets access,
- Kubernetes abstractions,
- dev portals,
- golden paths,
- environment provisioning,
- CI/CD templates.
A software factory augments platform engineers by turning the platform into something AI agents can use.
### DevEx / Developer Experience engineers
They focus on making developers faster and less frustrated:
- local dev setup,
- onboarding,
- build speed,
- test speed,
- documentation,
- CLIs,
- dev containers,
- preview environments,
- internal dashboards.
The software factory is essentially **DevEx for humans and AI agents together**.
### Developer Productivity engineers
They work on:
- build systems,
- test systems,
- monorepo tooling,
- CI performance,
- code generation,
- local development tools,
- dependency management.
The software factory would make their tooling usable by autonomous agents.
### Release engineers
They own:
- build artifacts,
- versioning,
- release branches,
- release notes,
- promotion between environments,
- rollback processes.
The software factory touches release engineering, but earlier: PR creation and validation.
### QA automation engineers
They own:
- E2E tests,
- Selenium/Cypress/Playwright,
- regression suites,
- test plans,
- screenshots/videos,
- test environments.
The software factory brings a first-pass QA loop into every AI-generated PR.
### SREs
They focus on production reliability:
- observability,
- incidents,
- SLOs,
- production automation,
- capacity,
- reliability.
The software factory may reduce production risk by catching bad changes earlier, but it does not replace SRE.
### Senior engineers / tech leads
In smaller teams, the senior engineer or tech lead often writes:
- `dev.sh`,
- `Makefile`,
- Docker Compose setup,
- onboarding docs,
- CI workflow,
- seed scripts,
- deployment scripts.
This is exactly the pattern with Diffie.
## What does this augment or replace?
The software factory mostly **augments**:
- platform engineers,
- DevOps engineers,
- DevEx teams,
- developer productivity teams,
- QA automation engineers,
- senior engineers and tech leads.
It may partially replace:
- manual PR preparation,
- manual smoke testing,
- some basic QA passes,
- custom glue scripts,
- repetitive environment setup,
- repetitive implementation work for small tickets.
It probably does **not** replace:
- strong platform engineering,
- senior architecture judgment,
- product engineering judgment,
- SRE,
- serious human code review,
- security/governance ownership.
The human role shifts from implementation operator to orchestrator, reviewer, product thinker, and platform designer.
## Why server infrastructure companies only partially solved this
Cloud providers and hosting platforms solved parts of the stack, but not the full software factory loop.
### Cloud providers
AWS, GCP, Azure, and DigitalOcean provide raw infrastructure: VMs, databases, networking, containers, storage.
They do not know how to run a specific repo.
### PaaS companies
Heroku, Render, Railway, Fly.io, and Northflank make deployment easier and may support preview environments.
But they still require the app to fit their model. A complex multi-service repo still needs project-specific orchestration.
### Frontend preview companies
Vercel, Netlify, and Cloudflare Pages are excellent for frontend preview deployments.
But many real apps need:
- backend services,
- databases,
- queues,
- workers,
- auth,
- migrations,
- internal APIs,
- service registration,
- seed data,
- browser automation.
So they solve an important slice, but not the full "agent can run the whole app and debug it" problem.
### Dev environment companies
GitHub Codespaces, Gitpod, Coder, DevPod, Daytona, and Dev Containers provide cloud/dev environments.
They provide the **box**. They do not automatically know the app lifecycle:
- start this service first,
- wait for this healthcheck,
- allocate these ports,
- register this service,
- expose this URL,
- run this validation,
- record this video,
- open this PR with evidence.
That knowledge still needs to live in a project-specific runtime contract.
## The Diffie example
Diffie already demonstrates the problem and the solution direction.
The current `dev.sh` is not just a convenience script. It is an encoded development environment contract.
It handles:
- loading `.env.dev`,
- applying worktree-specific overrides,
- setting `DATABASE_URL` based on the worktree Postgres port,
- setting `ELECTRIC_URL` based on the worktree Electric port,
- starting Docker Desktop on macOS,
- assuming systemd-managed Docker on Linux,
- starting Postgres and Electric SQL via Docker Compose,
- supporting unique Docker Compose project names for worktree isolation,
- waiting for Postgres readiness,
- installing dependencies with `bun install`,
- pushing schema with `bun run dev:push`,
- allocating dynamic ports from a configurable range,
- writing dynamic `.dev.vars` for realtime service auth,
- deriving Restate fabric/admin/ingress ports,
- starting Restate,
- waiting for Restate health,
- building and starting the test service,
- registering the test service with Restate,
- stopping temporary registration processes,
- handing off long-lived services to `concurrently`,
- starting auth, restate, api, test service, realtime, explore, app, and website,
- exporting all required env vars to child processes,
- printing app, API, explore, realtime, Electric, Restate, website, and tailnet URLs.
This is exactly the kind of complexity that a coding agent needs handled before it can effectively develop a feature.
A generic sandbox does not know any of this.
A sandbox gives the agent a place to run. Diffie's `dev.sh` tells the agent **how to make Diffie runnable**.
## Would a sandbox remove the need for `dev.sh`?
No.
A sandbox gives isolation. It does not give understanding.
The sandbox can provide:
- a clean machine,
- filesystem,
- CPU/RAM,
- Docker,
- browser,
- network,
- maybe secrets,
- maybe snapshotting,
- maybe public preview URL.
But it does not know:
- which services to start,
- which dependency manager to use,
- which database is needed,
- which migrations to run,
- which ports matter,
- which frontend URL is the real app,
- which background worker must run,
- how to seed data,
- how to run tests,
- how to know the app is ready,
- how to expose the preview,
- how to clean up.
So even with one sandbox per PR, we still need something like `dev.sh`.
But the role changes. `dev.sh` becomes less of a laptop startup script and more of a machine-readable runtime contract.
The evolution should be:
```text
local dev.sh -> runtime contract -> sandbox runtime
```
## The ideal factory interface
Each repo should expose standard commands like:
```bash
factory setup
factory start
factory status
factory test
factory validate
factory urls
factory stop
factory clean
```
Or:
```bash
./factory.sh setup
./factory.sh start --port-range 46200 --worktree-id job_123
./factory.sh status --json
./factory.sh urls --json
./factory.sh test
./factory.sh validate --feature "user can create project"
./factory.sh stop
```
This gives AI agents a stable contract. They do not need to understand every internal detail of Diffie. They need reliable commands, healthchecks, URLs, logs, and artifacts.
## Possible `factory.yaml` shape
```yaml
name: diffie
runtime:
package_manager: bun
setup:
- bun install
env_file: .env.dev
isolation:
strategy: git-worktree
port_range_size: 100
docker_project_prefix: diffie_factory
services:
postgres:
type: docker-compose
compose_file: packages/common/docker-compose.yaml
healthcheck: docker compose exec -T postgres pg_isready -U postgres
electric:
type: docker-compose
restate:
command: restate-server
healthcheck: curl -sf http://localhost:${RESTATE_ADMIN_PORT}/health
api:
command: cd packages/api && bun run dev
healthcheck: curl -sf http://localhost:${API_PORT}/health
app:
command: cd packages/app && bun run dev
url: http://localhost:${APP_PORT}
website:
command: cd packages/website && bun run dev -- --hostname 0.0.0.0 --port ${WEBSITE_PORT}
url: http://localhost:${WEBSITE_PORT}
tests:
unit:
- bun test
typecheck:
- bun run typecheck
validation:
browser:
start_url: http://localhost:${APP_PORT}
record_video: true
artifacts:
collect:
- logs/**
- test-results/**
- screenshots/**
- videos/**
```
This would be the formalized version of the current `dev.sh` knowledge.
## The architecture of the software factory
The product has three layers.
### 1. Sandbox layer
Provides isolated compute per PR/job.
Possible backends:
- local machine with git worktrees,
- Docker containers,
- Firecracker/microVMs,
- Modal,
- E2B,
- Daytona,
- Codespaces,
- devcontainers,
- Kubernetes jobs,
- remote CPU/GPU workers if needed.
### 2. Runtime contract layer
Project-specific instructions for making the app runnable.
Includes:
- setup,
- start,
- healthchecks,
- URLs,
- tests,
- validation,
- cleanup,
- artifact collection,
- logs.
This is the evolved `dev.sh`.
### 3. Agent workflow layer
Coordinates:
- Telegram intake,
- planning,
- coding,
- running tests,
- browser testing,
- video capture,
- PR creation,
- review bot,
- Telegram updates.
The durable product may be layer 2 + layer 3, with the sandbox provider being pluggable.
## State machine for a software factory job
A job should move through explicit states:
- `received`
- `planned`
- `worktree_created`
- `sandbox_created`
- `dev_environment_starting`
- `dev_environment_ready`
- `implementation_in_progress`
- `tests_running`
- `browser_validation_running`
- `video_recorded`
- `pr_opened`
- `review_running`
- `ready_for_human_review`
- `merged`
- `failed`
- `cancelled`
Each state should record:
- timestamps,
- logs,
- responsible agent,
- retry policy,
- artifact links,
- Telegram status updates.
## Roles inside the factory
### Planner
Reads the Telegram request, repo context, docs, issues, and existing code. Produces a scoped implementation plan and acceptance criteria.
### Builder
Implements the change inside the isolated worktree/sandbox.
### Environment keeper
Starts and monitors dev processes, checks health, exposes preview URLs, and cleans up resources.
### Tester / recorder
Uses browser or mobile automation to exercise the feature and record a video or screenshots.
### Reviewer
Performs first-pass code review, flags risks, checks style/security/regressions, and comments on the PR.
### Notifier
Keeps Telegram updated with progress, failures, PR links, preview links, and final artifacts.
## GitHub PR contract
Every factory-generated PR should include:
- original Telegram request link/message excerpt,
- problem statement,
- implementation summary,
- files changed / architecture notes,
- acceptance criteria checklist,
- test results,
- preview link,
- video artifact link,
- reviewer bot summary,
- known risks,
- rollback notes.
Labels could include:
- `factory-generated`,
- `needs-human-review`,
- project label,
- risk level label.
## Safety and controls
The factory needs guardrails:
- explicit allowlist mapping Telegram chats to repos,
- no arbitrary chat can trigger arbitrary code execution,
- secrets managed through existing secret stores,
- no secrets pasted into PRs or Telegram,
- concurrency limits per repo and per machine,
- cancellation command from Telegram,
- automatic cleanup of old worktrees/processes/containers,
- port/process leak detection,
- no force-pushing protected branches,
- human approval for destructive migrations, production deploys, billing changes, or customer-impacting changes.
## MVP path
Start with one pilot repo: Diffie.
1. Formalize Diffie's `dev.sh` and `scripts/worktree.sh` behavior into a runtime contract.
2. Define `software-factory.yaml` or `factory.yaml`.
3. Create Telegram intake for one project chat.
4. Implement job DB/state tracking.
5. Implement worktree creation and dynamic port allocation.
6. Start the Diffie dev environment in an isolated worktree/sandbox.
7. Capture health status and preview URLs.
8. Run a simple coding-agent task on a tiny UI/API change.
9. Open a GitHub PR automatically.
10. Add browser validation and screen recording.
11. Add reviewer bot pass.
12. Post final summary back to Telegram.
## Initial acceptance criteria
The MVP works when:
- A Telegram message in the project chat creates a GitHub PR without manual CLI intervention.
- The PR includes a working preview link for that branch/worktree/sandbox.
- The PR includes a video or screenshot artifact proving the validation agent opened and tested the feature.
- A second review bot leaves a review/check on the PR.
- Multiple jobs can run concurrently without port, Docker project, database, or branch collisions.
- Failed jobs report the failing stage, logs, and cleanup status back to Telegram.
## Competitive context
Existing AI coding agents include:
- Cursor,
- Claude Code,
- Codex,
- Devin,
- GitHub Copilot agent/workspace,
- Jules,
- OpenCode,
- Windsurf-style agents.
If the product is "send a prompt, get code," it is too crowded.
The sharper wedge is:
> Every AI-generated PR comes with a live preview, proof-of-test video, automated code review, and a reproducible isolated dev environment.
Possible positioning:
- AI-native Internal Developer Platform.
- Agent Runtime Platform for Software Engineering.
- Per-PR Development Sandbox Platform.
- Trust Layer for AI-generated Pull Requests.
- Software Factory Orchestrator.
Best crisp line:
> Your AI coding agents are useless unless they can run the app. We give every agent a working dev environment, browser, preview URL, and PR validation loop.
## Who might buy it?
Likely buyers/users:
- CTOs at AI-forward startups,
- heads of engineering,
- platform engineering leads,
- DevEx leads,
- engineering productivity teams,
- agencies building many web/mobile features,
- companies adopting Codex/Claude Code/Devin/Cursor agents,
- teams with complex local dev setup,
- teams with many PRs and slow review cycles.
This is most valuable for projects with:
- multiple services,
- nontrivial local setup,
- web/mobile UI,
- recurring feature requests,
- many small-to-medium product changes,
- many repos,
- desire to use AI agents in parallel,
- need for reviewability and auditability.
It is less valuable for:
- tiny scripts,
- greenfield prototypes,
- static sites,
- solo developers comfortable with local setup,
- teams whose CI/CD preview system already handles everything well.
## Key insight to remember
Pre-AI, humans absorbed missing context. They could ask a teammate, read docs, infer a missing step, restart a service, or manually inspect the browser.
Post-AI, the environment must be explicit enough for a machine.
AI turns an informal human workflow into a required machine-readable protocol.
The software factory should be the protocol and execution layer for:
```text
request -> branch -> sandbox -> running app -> feedback -> test -> video -> PR -> review
```
## Open questions
- Should the factory be a Hermes webhook/cron workflow, a dedicated long-running service, or a hybrid?
- Where should job state live: project-local SQLite, Anand's personal SQLite, or a hosted DB?
- Which agent should be the default builder: Codex CLI, Claude Code, OpenCode, or a router?
- What is the first preview exposure mechanism: tailnet, local tunnel, Vercel/Cloudflare, or project-specific dev URL?
- How strict should human approval gates be before PR creation vs before merge/deploy?
- Should this begin as internal tooling for Diffie or as a SaaS offering after the Diffie pilot?
- Is the core product the sandbox provider, the runtime contract, or the orchestrator that coordinates existing tools?
## Next concrete step
Write the Diffie pilot implementation plan:
1. Define `software-factory.yaml`.
2. Document the worktree/dev environment contract around `dev.sh`.
3. Design the job-state schema.
4. Define Telegram commands and status messages.
5. Define the PR template.
6. Choose browser/video recording mechanism.
7. Create a first end-to-end demo task that changes something small and opens a PR.
Related todo in the personal task DB: **Build software factory agent for mobile feature requests and PR previews**.