Files
autoglue/agents.md
Copilot 6878bd875b Add agents.md architecture documentation (#685)
* Initial plan

* Add comprehensive agents.md documentation

Co-authored-by: allanice001 <700853+allanice001@users.noreply.github.com>

* Update agents.md to address code review feedback

Co-authored-by: allanice001 <700853+allanice001@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: allanice001 <700853+allanice001@users.noreply.github.com>
2026-02-19 17:03:21 +00:00

341 lines
11 KiB
Markdown

# Autoglue Repository Architecture & Agents
## Overview
Autoglue is a Kubernetes cluster management platform built with Go that manages the lifecycle of K3s clusters across GlueOps-supported cloud providers. It provides a REST API for cluster provisioning, configuration, and management, along with a web UI and Terraform provider.
## Repository Structure
```
autoglue/
├── cmd/ # CLI commands
├── internal/ # Internal packages
│ ├── api/ # HTTP API routes and middleware
│ ├── app/ # Application setup
│ ├── auth/ # Authentication logic
│ ├── bg/ # Background job workers
│ ├── common/ # Common utilities
│ ├── config/ # Configuration management
│ ├── db/ # Database operations
│ ├── handlers/ # HTTP request handlers
│ ├── keys/ # Cryptographic key management
│ ├── mapper/ # Data mapping utilities
│ ├── models/ # Database models
│ ├── utils/ # Utility functions
│ ├── version/ # Version information
│ └── web/ # Web UI integration
├── sdk/ # Generated SDKs
│ └── ts/ # TypeScript SDK
├── ui/ # Frontend application (React)
├── docs/ # OpenAPI/Swagger documentation
├── postgres/ # PostgreSQL configuration
├── main.go # Application entry point
├── schema.sql # Database schema
└── docker-compose.yml # Development environment
```
## Core Components
### 1. API Layer (`internal/api`)
The API layer provides RESTful endpoints for managing cloud resources:
- **Authentication** (`mount_auth_routes.go`): OAuth/OIDC integration, JWT tokens
- **Clusters** (`mount_cluster_routes.go`): Kubernetes cluster management
- **Servers** (`mount_server_routes.go`): Server resource management
- **SSH Keys** (`mount_ssh_routes.go`): SSH key generation and management
- **DNS** (`mount_dns_routes.go`): DNS record management
- **Load Balancers** (`mount_load_balancer_routes.go`): Load balancer configuration
- **Node Pools** (`mount_node_pool_routes.go`): Worker node pool management
- **Credentials** (`mount_credential_routes.go`): Cloud provider credentials
- **Organizations** (`mount_org_routes.go`): Multi-tenant organization management
**Middleware:**
- Request logging with zerolog
- Rate limiting (1000 requests/minute per IP)
- CORS handling
- Security headers
- Authentication/Authorization
- Request body size limits (10MB)
### 2. Background Jobs (`internal/bg`)
Autoglue uses the [Archer](https://github.com/dyaksa/archer) job queue system with PostgreSQL-backed job persistence.
**Active Job Workers:**
| Worker | Purpose | Timeout |
|--------|---------|---------|
| `bootstrap_bastion` | Provision and configure bastion host servers | Configurable (default 60s) |
| `archer_cleanup` | Clean up old job records | 5 minutes |
| `tokens_cleanup` | Purge expired refresh tokens | 5 minutes |
| `db_backup_s3` | Backup database to S3 | 15 minutes |
| `dns_reconcile` | Synchronize DNS records with Route53 | 2 minutes |
| `org_key_sweeper` | Remove expired organization API keys | 5 minutes |
| `cluster_action` | Execute cluster lifecycle actions | Configurable |
**Planned Job Workers (Currently Disabled):**
The following workers exist in the codebase but are currently commented out:
- `prepare_cluster` - Prepare infrastructure for cluster deployment
- `cluster_setup` - Initial cluster configuration
- `cluster_bootstrap` - Full Kubernetes cluster bootstrapping process
**Configuration:**
- `archer.instances`: Number of worker instances (default: 1)
- `archer.timeoutSec`: Job timeout in seconds (default: 60)
- `archer.cleanup_retain_days`: Job retention period (default: 7 days)
### 3. Data Models (`internal/models`)
**Core Models:**
- `User` - User accounts with OAuth integration
- `Organization` - Multi-tenant organizations
- `Membership` - User-organization relationships
- `ApiKey` - API authentication tokens (user and org-level)
- `OrganizationKey` - Organization-level credentials with auto-expiry
- `Cluster` - Kubernetes cluster definitions
- `NodePool` - Worker node group configurations
- `Server` - Individual server instances
- `SshKey` - SSH keypair management with encryption
- `LoadBalancer` - Load balancer configurations
- `Domain` - DNS domain management
- `Credential` - Cloud provider API credentials (AWS, etc.)
- `Job` - Background job queue records
- `SigningKey` - JWT signing keys with rotation
- `RefreshToken` - OAuth refresh token storage
- `MasterKey` - Master encryption key for data at rest
- `Label`, `Annotation`, `Taint` - Kubernetes resource metadata
### 4. Handlers (`internal/handlers`)
Request handlers implement business logic for API endpoints:
- `auth.go` - OAuth flows, token issuance
- `clusters.go` - Cluster CRUD operations
- `servers.go` - Server provisioning
- `ssh_keys.go` - SSH key generation with Ed25519/RSA support
- `dns.go` - DNS record management via Route53
- `load_balancers.go` - Load balancer configuration
- `node_pools.go` - Node pool management with labels/annotations/taints
- `credentials.go` - Cloud credential storage
- `orgs.go` - Organization management
- `me.go` - Current user information
- `me_keys.go` - User API key management
- `health.go` - Health check endpoints
- `version.go` - Version information
### 5. Security & Encryption
**Cryptography:**
- **Master Key**: AES-256-GCM encryption for root secrets
- **Organization Keys**: Per-org encryption keys derived from master key
- **SSH Keys**: Secure generation and encrypted storage
- **JWT Tokens**: RS256 signing with key rotation
- **API Keys**: Argon2id hashing for token storage
- **At-Rest Encryption**: All sensitive data (kubeconfigs, credentials, SSH keys)
**Authentication Methods:**
1. OAuth/OIDC (Google Workspace integration)
2. Bearer tokens (JWT)
3. Organization Key/Secret pairs
4. User API keys
### 6. CLI Commands (`cmd`)
- `serve` - Start the API server (default command)
- `keys generate` - Generate JWT signing keys
- `encrypt create-master` - Create master encryption key
- `db` - Database management utilities
- `version` - Display version information
### 7. Integration Points
**Cloud Providers:**
- AWS (Route53 for DNS, S3 for backups)
- Support for multi-cloud credentials
**External Services:**
- PostgreSQL (primary data store)
- S3-compatible storage (backups)
- OAuth providers (Google)
**SDKs:**
- TypeScript SDK (`sdk/ts/`) - Generated from OpenAPI spec
- Go SDK (consumed via module alias) - Used by external integrations
**External Integrations:**
- Terraform Provider - Separate repository providing IaC support for Autoglue resources
## Development Workflow
### Prerequisites
- Go 1.25.4+
- Docker & Docker Compose
- PostgreSQL (via docker-compose)
- Node.js (for UI development)
### Setup
```bash
# 1. Configure environment
cp .env.example .env
# 2. Start database
docker compose up -d
# 3. Generate JWT keys
go run . keys generate
# 4. Create master encryption key
go run . encrypt create-master
# 5. Update OpenAPI docs and SDKs
make swagger
make sdk-all
# 6. Start API server with embedded UI
go run .
```
### Build & Test
```bash
# Build application
go build -o autoglue .
# Run tests
go test ./...
# Build UI
make ui
```
**Note:** The Terraform provider is maintained in a separate repository.
## API Architecture
### Request Flow
```
Client → CORS → Rate Limit → Logger → Auth → Handler → DB/Jobs → Response
```
### Authentication Flow
1. User logs in via OAuth (Google)
2. Backend validates token with provider
3. JWT access token issued (short-lived)
4. Refresh token stored in DB
5. Organization context from `X-Org-ID` header
### Job Execution Flow
1. Handler enqueues job via `Jobs.Enqueue()`
2. Archer worker picks up job from PostgreSQL
3. Worker executes task with timeout
4. Result stored in `jobs` table
5. Retries on failure (configurable)
## Database Schema
**Key Tables:**
- `users` - User accounts
- `accounts` - OAuth provider linkage
- `organizations` - Tenant isolation
- `memberships` - User-org relationships
- `api_keys` - Authentication tokens
- `clusters` - K8s cluster definitions
- `node_pools` - Worker node groups
- `servers` - Compute instances
- `ssh_keys` - SSH keypair storage
- `load_balancers` - LB configurations
- `domains` - DNS domains
- `credentials` - Cloud API credentials
- `jobs` - Background job queue
- `signing_keys` - JWT key rotation
- `refresh_tokens` - OAuth token storage
- `master_keys` - Encryption key hierarchy
## Configuration
Environment variables (`.env`):
- `DATABASE_URL` - PostgreSQL connection string
- `JWT_PRIVATE_ENC_KEY` - JWT private key encryption
- `GOOGLE_CLIENT_ID` / `GOOGLE_CLIENT_SECRET` - OAuth
- `ALLOWED_ORIGINS` - CORS configuration
- `archer.*` - Job queue settings
- AWS credentials for Route53/S3
## Deployment
### Docker
```bash
docker build -t autoglue .
docker run -p 8080:8080 --env-file .env autoglue
```
### Production Considerations
- Database connection pooling
- Rate limiting configuration
- CORS allowed origins
- JWT key rotation schedule
- Backup retention policies
- Worker instance scaling
- Monitoring and alerting
## API Documentation
- **Swagger UI**: `http://localhost:8080/swagger/index.html`
- **OpenAPI Spec**: `docs/openapi.yaml`
- **SDK Documentation**: `sdk/ts/README.md`
## Testing
The repository includes:
- Unit tests for handlers (`*_test.go`)
- Test utilities (`internal/testutil/`)
- Integration tests with embedded PostgreSQL
Run tests:
```bash
go test ./internal/handlers/
go test -v ./...
```
## Key Features
1. **Multi-tenancy**: Organization-based resource isolation
2. **Encryption at Rest**: All sensitive data encrypted per-org
3. **Async Job Processing**: Background tasks with retry logic
4. **API Key Management**: Multiple authentication methods
5. **SSH Key Generation**: Automated keypair creation (RSA/Ed25519)
6. **DNS Automation**: Route53 integration for DNS records
7. **Kubernetes Management**: Cluster lifecycle automation
8. **Terraform Provider**: Infrastructure-as-Code support
9. **Web UI**: React-based management interface
10. **OpenAPI/Swagger**: Auto-generated API documentation
## Architecture Patterns
- **Repository Pattern**: Data access abstraction via GORM
- **Dependency Injection**: Dependencies passed to handlers
- **Middleware Chain**: Request processing pipeline
- **Job Queue**: Async processing with Archer
- **Multi-tenant**: Organization-scoped data isolation
- **Encryption**: Key hierarchy (master → org → resource)
## Future Enhancements
Based on commented code and structure:
- Full cluster provisioning automation
- Additional cloud provider support
- Enhanced monitoring and observability
- Cluster backup and restore
- Advanced RBAC controls
- Custom resource definitions
## Resources
- **GitHub**: https://github.com/GlueOps/autoglue
- **Production API**: https://autoglue.glueopshosted.com/api/v1
- **Pre-prod API**: https://autoglue.glueopshosted.rocks/api/v1
- **Staging API**: https://autoglue.apps.nonprod.earth.onglueops.rocks/api/v1