Skip to content

Architecture Overview

ALPCRUN.CH is designed as a cloud-native, microservices-based HPC middleware with a focus on scalability, reliability, and performance.

Design Principles

1. Streaming-First Architecture

All communication between clients, queue manager, and workers uses bidirectional gRPC streaming. This design choice provides:

  • Low latency: No request/response overhead for each task
  • High throughput: Continuous data flow without connection overhead
  • Natural backpressure: Stream flow control prevents overload
  • Efficient batching: Multiple tasks/results in single network operations

2. Queue-Based Task Distribution

ALPCRUN.CH uses a centralized queue manager with per-session queues:

  • Task queues: Pending work for each session
  • Result queues: Completed work awaiting client collection
  • Dead letter queues: Failed tasks for inspection and retry

This architecture enables:

  • Fair scheduling: Priority-based task distribution
  • Session isolation: Independent workload contexts
  • Worker flexibility: Workers can bind/unbind from sessions dynamically

3. Distributed Caching

The two-tier cache architecture (central + node-level) provides:

  • Shared data distribution: Common data replicated to all workers
  • Reduced network traffic: Local node caches minimize central cache requests
  • Versioned updates: Track data changes with update levels
  • TTL-based expiration: Automatic cleanup of stale data

System Architecture

3-Tier architecture with the middleware providing the low-latency interfacing. User written clients and worker services can use the gRPC language bindings provided with the SDK for easy integration.

ALPCRUN.CH Architecture Overview

Core Components

Queue Manager

Purpose: Central coordinator for session management and task distribution

Responsibilities: - Session lifecycle management (create, update, close) - Task and result queue management per session - Worker registration and instruction (BIND/UNBIND) - Priority-based scheduling - Dead letter queue handling

Interfaces: - gRPC service on port 1337 - Prometheus metrics on port 8080

Key Features: - Multiple queue implementations (priority, lock-free, speed-test) - Configurable buffer sizes - Idle timeout handling - Worker limits per session

Central Cache

Purpose: Centralized storage for session-shared data

Responsibilities: - Store and retrieve shared data - Version tracking (update levels) - TTL-based expiration - Atomic updates

Interfaces: - gRPC service on port 2337 - Prometheus metrics on port 8081

Key Features: - In-memory storage for low latency - Configurable maximum TTL - Periodic cleanup of expired data - Thread-safe concurrent access

Node Cache

Purpose: Local pull-through cache for reduced network overhead

Responsibilities: - Cache shared data locally on each compute node - Proxy requests to central cache on cache miss - Maintain consistency with central cache - Reduce central cache load

Interfaces: - gRPC service on port 3337 - Prometheus metrics on port 8082

Deployment: - Typically deployed as a Kubernetes DaemonSet - One instance per compute node - Workers connect to local node cache

Data Flow

Task Submission Flow

  1. Client creates a session with the queue manager
  2. Client optionally uploads shared data to central cache
  3. Client opens a bidirectional stream to queue manager
  4. Client sends task batches to the stream
  5. Queue Manager enqueues tasks in session's task queue
  6. Queue Manager schedules workers to process the session
  7. Worker receives task batches from the stream
  8. Worker fetches shared data from node/central cache if needed
  9. Worker processes tasks and sends result batches back
  10. Queue Manager enqueues results in session's result queue
  11. Client receives result batches from the stream

Worker Binding Flow

  1. Worker connects to queue manager via WorkerUpdate stream
  2. Queue Manager assigns a unique worker ID (via response header)
  3. Worker sends idle state updates periodically
  4. Queue Manager sends BIND instruction when work is available
  5. Worker opens WorkerStream to the specified session
  6. Worker processes tasks until session closes or idle timeout
  7. Queue Manager sends UNBIND instruction
  8. Worker returns to idle state

Shared Data Flow

  1. Client uploads data to central cache with session ID and update level
  2. Client sends task batches with use_shared_data=true flag
  3. Worker receives task batch with shared data reference
  4. Worker requests data from node cache (local)
  5. Node Cache checks local storage:
  6. Cache hit: Returns data immediately
  7. Cache miss: Fetches from central cache and caches locally
  8. Worker uses shared data for task processing

Scalability

Horizontal Scaling

All components are designed for horizontal scalability:

Queue Manager: - Can run multiple instances behind a load balancer - Session affinity not required (stateless except in-memory queues) - For production, consider external queue storage (future enhancement)

Central Cache: - Single instance typically sufficient - Can be replicated with consistent hashing (future enhancement) - Backed by distributed storage for high availability

Node Cache: - Automatically scales with cluster size (DaemonSet) - No coordination required between instances

Workers: - Fully stateless and horizontally scalable - Add/remove workers dynamically without disruption - Auto-scaling based on queue depth

Performance Characteristics

Throughput: - Queue Manager: 10,000+ tasks/sec per session - Central Cache: 5,000+ operations/sec - Node Cache: 50,000+ operations/sec (cache hits)

Latency: - Task submission to worker: <10ms (queue + network) - Cache lookup (hit): <1ms (node cache) - Cache lookup (miss): <5ms (central cache)

Capacity: - Sessions per queue manager: 1,000+ - Workers per session: 10,000+ - Concurrent tasks: Limited by worker capacity

Reliability and Fault Tolerance

Dead Letter Queues

Failed tasks are automatically moved to dead letter queues:

  • Clients can retrieve dead letter batches via ClientStreamDeadLetters
  • Inspect errors and optionally retry
  • Prevents silent data loss

Session Cleanup

Automatic cleanup on client disconnect (optional):

  • Set cleanup_on_client_exit=true in session attributes
  • Queue manager removes session and associated data
  • Prevents resource leaks

Worker Failure Handling

If a worker crashes or disconnects:

  • In-flight tasks are automatically re-queued
  • Queue manager detects disconnect via stream error
  • New worker can pick up tasks immediately

Network Resilience

gRPC streams automatically handle:

  • Connection keepalives
  • Reconnection with backoff
  • Flow control and backpressure

Security

Authentication

API key-based authentication:

  • Required for all client and worker connections
  • Passed via gRPC metadata header
  • Bcrypt-hashed keys for production

See Authentication Reference for details.

Transport Security

TLS/mTLS support:

  • Configurable per service
  • Certificate-based mutual authentication
  • Recommended for production deployments

Network Policies

Kubernetes network policies recommended:

  • Restrict queue manager access to authorized pods
  • Isolate cache services from external traffic
  • Worker-to-cache communication only

Monitoring and Observability

Metrics

All services expose Prometheus metrics:

  • Connection counts (clients, workers)
  • Queue depths (tasks, results, dead letters)
  • Task/result throughput rates
  • Cache hit/miss rates
  • gRPC call latencies and error rates

Logging

Structured logging with configurable levels:

  • JSON format for machine parsing
  • Request ID propagation for tracing
  • Source file and line numbers (optional)
  • Pretty printing for development

Profiling

Built-in pprof support:

  • CPU and memory profiling
  • Goroutine analysis
  • Mutex contention profiling
  • Enabled via --pprof flag

Next Steps