← Systems

Case Study — 01

ComfyUI Enterprise Pipeline Architecture

Serverless GPU orchestration at A100/H100 scale — VPC isolation, SOC2 Type II and HIPAA-compliant compute fabric for high-throughput visual generation.

Domain

Visual AI Infrastructure

Scale

A100 / H100 · up to 32 concurrent workers

Compliance

SOC2 Type II · HIPAA

Throughput

2.4M images / month

p50 Latency

1.2s (H100 · SDXL 1024×1024)


title: "ComfyUI Enterprise Pipeline Architecture"

The Compute Problem at Enterprise Scale

Consumer GPU clouds are optimized for individual use. They run shared tenants on shared hardware, enforce rate limits calibrated for hobbyist throughput, and offer no contractual SLAs. An enterprise generating 40,000 inference requests per day against that infrastructure hits queue saturation by mid-morning. Cold start latency on shared infrastructure routinely lands between 8 and 45 seconds — an interval that is architecturally incompatible with production deployment.

The problem is not access to GPUs. The problem is governance, isolation, and determinism at the layer where inference happens. Enterprise clients in regulated industries — healthcare imaging, legal document processing, pharmaceutical research visualization — require a compute fabric that can be audited, isolated per tenant, and attested against compliance frameworks. ComfyUI, the most capable open-source node-based diffusion pipeline available, needed an enterprise wrapper worthy of the workloads it could serve.

This case study documents the architecture of that wrapper.


Architecture Overview

The system is composed of five discrete layers. Each layer has a single responsibility and communicates with adjacent layers through contracts, not implementation details.

Layer 1 — Inference Router. A TypeScript service running on AWS Lambda (arm64, 512MB) that receives inference requests, validates the payload schema, assigns a priority class (real-time, standard, batch), and enqueues the job. It returns a job ID and estimated completion timestamp within 80ms. No GPU is touched at this layer.

Layer 2 — Priority Queue. Amazon SQS with three FIFO queues mapped to the three priority classes. Real-time queue has a visibility timeout of 10 seconds. Standard is 45 seconds. Batch is 300 seconds. Dead-letter queues capture failures after two attempts. Job metadata — prompt hash, resolution, model version, tenant ID — is stored in DynamoDB with TTL of 72 hours.

Layer 3 — GPU Worker Pool. Containerized ComfyUI workers running on EC2 capacity reservations and supplemented by Spot instances for batch workloads. The container image is immutable per model version, built in CI, and stored in ECR. Workers pull from SQS, execute the ComfyUI workflow graph via its internal API, and write outputs directly to S3.

Layer 4 — Object Storage. Amazon S3 with per-tenant bucket isolation. No public access blocks are configured at the account level. Outputs are accessible only via presigned URLs with a 15-minute expiry, generated by the delivery service on authenticated request.

Layer 5 — Delivery and CDN. CloudFront distribution with signed cookies for authenticated sessions. Origin access control prevents direct S3 access. All cache behavior is disabled for inference outputs (no-store, no-cache) to prevent cross-tenant content leakage.


GPU Scaling: A100 versus H100

Hardware selection is a decision with direct P&L consequences at this throughput. The choice between A100 and H100 is not simply about speed — it is about matching the latency envelope of the use case to the cost curve of the hardware.

The A100 80GB SXM4

The NVIDIA A100 80GB SXM4 delivers 312 TFLOPS at BF16 precision with 2 TB/s of memory bandwidth. For SDXL inference at 1024×1024 with 20 DPM++ steps, the A100 produces a completed image in approximately 2.6 to 2.9 seconds under production load, accounting for model load time from GPU memory and the ComfyUI graph traversal overhead.

The A100 is the correct choice for asynchronous batch pipelines where throughput per dollar is the optimization target. A single A100 instance can process approximately 1,200 SDXL images per hour. At 8 workers, that is 9,600 images per hour — sufficient for clients with daily volumes under 100,000 images and no hard real-time SLA requirement.

The H100 80GB SXM5

The NVIDIA H100 80GB SXM5 delivers 989 TFLOPS at BF16 precision — a 3.2× improvement over the A100 in raw compute — with 3.35 TB/s of memory bandwidth. The memory bandwidth increase is the more consequential number for diffusion inference, where the memory bus is frequently the bottleneck rather than raw FLOP count.

In production, the H100 executes the same SDXL 1024×1024 inference in 0.85 to 1.1 seconds. At eight concurrent workers, that yields approximately 28,000 images per hour. For clients with sub-2-second SLA requirements — interactive product visualization, real-time creative tooling, or on-demand document generation — the H100 is the only viable option.

Cold Start Elimination

The most significant latency source in a serverless GPU architecture is not the GPU itself — it is container cold start. Loading a 6.9GB SDXL checkpoint from EBS into GPU VRAM takes 18 to 22 seconds. This is unacceptable in any context.

The solution is a minimum pool of pre-warmed workers — two always-on instances per tenant for real-time queues, held in a ready state with models pre-loaded into VRAM. Auto-scaling triggers when the SQS queue depth exceeds 10 jobs, scaling by two workers per 30-second interval up to a maximum of 32 concurrent workers. Scale-to-zero occurs after five minutes of idle time on workers beyond the minimum pool.

This architecture reduces p50 cold start contribution to zero. The p50 end-to-end latency — from API receipt to presigned URL delivery — is 1.2 seconds on H100 and 3.1 seconds on A100 under standard load.


VPC Isolation Architecture

Multi-tenancy in a shared compute environment is not a configuration option — it is an architectural property that must be designed in from the start. The isolation model here provides hard network boundaries between tenant workloads.

Each enterprise tenant receives a dedicated VPC in the host AWS account. No resources are shared across VPC boundaries. GPU worker instances run in private subnets with no public IP addresses. The only ingress path is from the inference router, which communicates via a VPC peering connection and a Security Group rule allowing TCP on port 8080 from the router's security group ID exclusively.

Egress is equally constrained. Workers are permitted to reach three destinations only: the tenant's S3 bucket via a VPC endpoint, the SQS FIFO queue via a VPC endpoint, and the ECR registry for image pulls via a VPC endpoint. All three use AWS PrivateLink — traffic never traverses the public internet. Explicit deny rules in the VPC Network ACL block all other egress.

VPC Flow Logs are enabled on all subnets and streamed to CloudWatch Logs with a 90-day retention policy. Anomalous traffic patterns — unexpected egress destinations, connection attempts to blocked ports — trigger CloudWatch alarms with PagerDuty escalation.

For clients requiring multi-region redundancy, a Transit Gateway connects the primary VPC to a standby VPC in a secondary AWS region. Failover is automatic, DNS-based, and completes within 45 seconds under the health check configuration.


SOC2 Type II and HIPAA Compliance Architecture

Compliance is not a checkbox applied at the end of an engagement. It is a set of architectural invariants that shape every design decision from the outset.

SOC2 Type II Controls

Three Trust Service Criteria are directly implicated by an inference pipeline:

CC6.1 — Logical and Physical Access Controls. All access to GPU workers, S3 buckets, and SQS queues is governed by IAM roles with least-privilege policies. No human operator holds static access credentials to production resources. Engineers access the environment via AWS SSO with MFA enforcement and session duration limits of four hours. All role assumption events are logged to CloudTrail with immutable storage in S3 Glacier.

CC6.6 — Encryption and Key Management. All data at rest is encrypted with AES-256 using AWS KMS customer-managed keys (CMKs). Each tenant receives a dedicated CMK. Key rotation is automated at 90-day intervals. Encryption in transit enforces TLS 1.3 minimum — TLS 1.2 is disabled at the load balancer level. S3 bucket policies explicitly deny unencrypted uploads ("aws:SecureTransport": "false" condition on Deny statements).

CC7.2 — System Monitoring and Anomaly Detection. Amazon GuardDuty is enabled at the account level for ML-based threat detection. CloudWatch metric alarms cover GPU memory utilization (>90% for >5 minutes), SQS dead-letter queue depth (>0 messages), and S3 access pattern anomalies. All alarms route to an SNS topic with PagerDuty and Slack integrations.

HIPAA Technical Safeguards

Healthcare clients processing medical imaging data through the pipeline require HIPAA Business Associate Agreements and compliance with the HIPAA Security Rule's Technical Safeguard requirements (45 CFR §164.312).

The architectural guarantee for HIPAA compliance is PHI exclusion at the ingestion layer. The inference router validates that no incoming payload contains structured PHI fields — patient identifiers, dates of birth, medical record numbers — using a schema enforcement layer before the job reaches the queue. If PHI is detected, the request is rejected with a 422 status and an audit event is logged.

For clients that legitimately process de-identified medical images (histopathology slides, radiology outputs stripped of DICOM metadata), the architecture maintains:

Three enterprise healthcare clients have signed BAAs against this architecture. The first SOC2 Type II audit, covering a 12-month observation period, returned zero exceptions.


Deployment Configuration

The production deployment is managed as Terraform modules with separate state backends per environment. Promotion from staging to production requires a passing GitHub Actions pipeline — unit tests on the inference router, integration tests against a mock ComfyUI instance, and a Checkov scan of the Terraform plan for security misconfigurations.

// Inference job schema — TypeScript (production)
interface InferenceJob {
  jobId:      string;          // UUIDv4
  tenantId:   string;          // Opaque tenant identifier
  priority:   'realtime' | 'standard' | 'batch';
  workflow:   ComfyWorkflow;   // ComfyUI API-format graph
  resolution: [number, number]; // Width, height in pixels
  seed:       number;          // Deterministic seed for reproducibility
  model:      string;          // ECR image tag for worker container
  createdAt:  string;          // ISO 8601
  ttl:        number;          // Unix timestamp, 72h from createdAt
}

Deployments use a blue/green strategy with zero-downtime cutover. The active worker pool (blue) continues processing jobs while the new pool (green) passes health checks. Traffic shifts at the SQS consumer level — green workers begin pulling from queues; blue workers drain their current jobs and stop. Total cutover time: under 90 seconds.


Measured Outcomes

Across three enterprise clients operating this architecture in production over 24 months:

The architecture is purpose-built for workloads that cannot tolerate ambiguity. Every design decision — from VPC topology to key rotation intervals to cold start strategy — is traceable to a specific compliance requirement, latency target, or cost constraint. Nothing is present by convention.

← All SystemsNext: B2B Voice Agent Collections →