diff --git a/agents.md b/agents.md new file mode 100644 index 0000000..272eee1 --- /dev/null +++ b/agents.md @@ -0,0 +1,340 @@ +# Autoglue Repository Architecture & Agents + +## Overview + +Autoglue is a Kubernetes cluster management platform built with Go that manages the lifecycle of K3s clusters across GlueOps-supported cloud providers. It provides a REST API for cluster provisioning, configuration, and management, along with a web UI and Terraform provider. + +## Repository Structure + +``` +autoglue/ +├── cmd/ # CLI commands +├── internal/ # Internal packages +│ ├── api/ # HTTP API routes and middleware +│ ├── app/ # Application setup +│ ├── auth/ # Authentication logic +│ ├── bg/ # Background job workers +│ ├── common/ # Common utilities +│ ├── config/ # Configuration management +│ ├── db/ # Database operations +│ ├── handlers/ # HTTP request handlers +│ ├── keys/ # Cryptographic key management +│ ├── mapper/ # Data mapping utilities +│ ├── models/ # Database models +│ ├── utils/ # Utility functions +│ ├── version/ # Version information +│ └── web/ # Web UI integration +├── sdk/ # Generated SDKs +│ └── ts/ # TypeScript SDK +├── ui/ # Frontend application (React) +├── docs/ # OpenAPI/Swagger documentation +├── postgres/ # PostgreSQL configuration +├── main.go # Application entry point +├── schema.sql # Database schema +└── docker-compose.yml # Development environment + +``` + +## Core Components + +### 1. API Layer (`internal/api`) + +The API layer provides RESTful endpoints for managing cloud resources: + +- **Authentication** (`mount_auth_routes.go`): OAuth/OIDC integration, JWT tokens +- **Clusters** (`mount_cluster_routes.go`): Kubernetes cluster management +- **Servers** (`mount_server_routes.go`): Server resource management +- **SSH Keys** (`mount_ssh_routes.go`): SSH key generation and management +- **DNS** (`mount_dns_routes.go`): DNS record management +- **Load Balancers** (`mount_load_balancer_routes.go`): Load balancer configuration +- **Node Pools** (`mount_node_pool_routes.go`): Worker node pool management +- **Credentials** (`mount_credential_routes.go`): Cloud provider credentials +- **Organizations** (`mount_org_routes.go`): Multi-tenant organization management + +**Middleware:** +- Request logging with zerolog +- Rate limiting (1000 requests/minute per IP) +- CORS handling +- Security headers +- Authentication/Authorization +- Request body size limits (10MB) + +### 2. Background Jobs (`internal/bg`) + +Autoglue uses the [Archer](https://github.com/dyaksa/archer) job queue system with PostgreSQL-backed job persistence. + +**Active Job Workers:** + +| Worker | Purpose | Timeout | +|--------|---------|---------| +| `bootstrap_bastion` | Provision and configure bastion host servers | Configurable (default 60s) | +| `archer_cleanup` | Clean up old job records | 5 minutes | +| `tokens_cleanup` | Purge expired refresh tokens | 5 minutes | +| `db_backup_s3` | Backup database to S3 | 15 minutes | +| `dns_reconcile` | Synchronize DNS records with Route53 | 2 minutes | +| `org_key_sweeper` | Remove expired organization API keys | 5 minutes | +| `cluster_action` | Execute cluster lifecycle actions | Configurable | + +**Planned Job Workers (Currently Disabled):** + +The following workers exist in the codebase but are currently commented out: +- `prepare_cluster` - Prepare infrastructure for cluster deployment +- `cluster_setup` - Initial cluster configuration +- `cluster_bootstrap` - Full Kubernetes cluster bootstrapping process + +**Configuration:** +- `archer.instances`: Number of worker instances (default: 1) +- `archer.timeoutSec`: Job timeout in seconds (default: 60) +- `archer.cleanup_retain_days`: Job retention period (default: 7 days) + +### 3. Data Models (`internal/models`) + +**Core Models:** + +- `User` - User accounts with OAuth integration +- `Organization` - Multi-tenant organizations +- `Membership` - User-organization relationships +- `ApiKey` - API authentication tokens (user and org-level) +- `OrganizationKey` - Organization-level credentials with auto-expiry +- `Cluster` - Kubernetes cluster definitions +- `NodePool` - Worker node group configurations +- `Server` - Individual server instances +- `SshKey` - SSH keypair management with encryption +- `LoadBalancer` - Load balancer configurations +- `Domain` - DNS domain management +- `Credential` - Cloud provider API credentials (AWS, etc.) +- `Job` - Background job queue records +- `SigningKey` - JWT signing keys with rotation +- `RefreshToken` - OAuth refresh token storage +- `MasterKey` - Master encryption key for data at rest +- `Label`, `Annotation`, `Taint` - Kubernetes resource metadata + +### 4. Handlers (`internal/handlers`) + +Request handlers implement business logic for API endpoints: + +- `auth.go` - OAuth flows, token issuance +- `clusters.go` - Cluster CRUD operations +- `servers.go` - Server provisioning +- `ssh_keys.go` - SSH key generation with Ed25519/RSA support +- `dns.go` - DNS record management via Route53 +- `load_balancers.go` - Load balancer configuration +- `node_pools.go` - Node pool management with labels/annotations/taints +- `credentials.go` - Cloud credential storage +- `orgs.go` - Organization management +- `me.go` - Current user information +- `me_keys.go` - User API key management +- `health.go` - Health check endpoints +- `version.go` - Version information + +### 5. Security & Encryption + +**Cryptography:** +- **Master Key**: AES-256-GCM encryption for root secrets +- **Organization Keys**: Per-org encryption keys derived from master key +- **SSH Keys**: Secure generation and encrypted storage +- **JWT Tokens**: RS256 signing with key rotation +- **API Keys**: Argon2id hashing for token storage +- **At-Rest Encryption**: All sensitive data (kubeconfigs, credentials, SSH keys) + +**Authentication Methods:** +1. OAuth/OIDC (Google Workspace integration) +2. Bearer tokens (JWT) +3. Organization Key/Secret pairs +4. User API keys + +### 6. CLI Commands (`cmd`) + +- `serve` - Start the API server (default command) +- `keys generate` - Generate JWT signing keys +- `encrypt create-master` - Create master encryption key +- `db` - Database management utilities +- `version` - Display version information + +### 7. Integration Points + +**Cloud Providers:** +- AWS (Route53 for DNS, S3 for backups) +- Support for multi-cloud credentials + +**External Services:** +- PostgreSQL (primary data store) +- S3-compatible storage (backups) +- OAuth providers (Google) + +**SDKs:** +- TypeScript SDK (`sdk/ts/`) - Generated from OpenAPI spec +- Go SDK (consumed via module alias) - Used by external integrations + +**External Integrations:** +- Terraform Provider - Separate repository providing IaC support for Autoglue resources + +## Development Workflow + +### Prerequisites +- Go 1.25.4+ +- Docker & Docker Compose +- PostgreSQL (via docker-compose) +- Node.js (for UI development) + +### Setup +```bash +# 1. Configure environment +cp .env.example .env + +# 2. Start database +docker compose up -d + +# 3. Generate JWT keys +go run . keys generate + +# 4. Create master encryption key +go run . encrypt create-master + +# 5. Update OpenAPI docs and SDKs +make swagger +make sdk-all + +# 6. Start API server with embedded UI +go run . +``` + +### Build & Test +```bash +# Build application +go build -o autoglue . + +# Run tests +go test ./... + +# Build UI +make ui +``` + +**Note:** The Terraform provider is maintained in a separate repository. + +## API Architecture + +### Request Flow +``` +Client → CORS → Rate Limit → Logger → Auth → Handler → DB/Jobs → Response +``` + +### Authentication Flow +1. User logs in via OAuth (Google) +2. Backend validates token with provider +3. JWT access token issued (short-lived) +4. Refresh token stored in DB +5. Organization context from `X-Org-ID` header + +### Job Execution Flow +1. Handler enqueues job via `Jobs.Enqueue()` +2. Archer worker picks up job from PostgreSQL +3. Worker executes task with timeout +4. Result stored in `jobs` table +5. Retries on failure (configurable) + +## Database Schema + +**Key Tables:** +- `users` - User accounts +- `accounts` - OAuth provider linkage +- `organizations` - Tenant isolation +- `memberships` - User-org relationships +- `api_keys` - Authentication tokens +- `clusters` - K8s cluster definitions +- `node_pools` - Worker node groups +- `servers` - Compute instances +- `ssh_keys` - SSH keypair storage +- `load_balancers` - LB configurations +- `domains` - DNS domains +- `credentials` - Cloud API credentials +- `jobs` - Background job queue +- `signing_keys` - JWT key rotation +- `refresh_tokens` - OAuth token storage +- `master_keys` - Encryption key hierarchy + +## Configuration + +Environment variables (`.env`): +- `DATABASE_URL` - PostgreSQL connection string +- `JWT_PRIVATE_ENC_KEY` - JWT private key encryption +- `GOOGLE_CLIENT_ID` / `GOOGLE_CLIENT_SECRET` - OAuth +- `ALLOWED_ORIGINS` - CORS configuration +- `archer.*` - Job queue settings +- AWS credentials for Route53/S3 + +## Deployment + +### Docker +```bash +docker build -t autoglue . +docker run -p 8080:8080 --env-file .env autoglue +``` + +### Production Considerations +- Database connection pooling +- Rate limiting configuration +- CORS allowed origins +- JWT key rotation schedule +- Backup retention policies +- Worker instance scaling +- Monitoring and alerting + +## API Documentation + +- **Swagger UI**: `http://localhost:8080/swagger/index.html` +- **OpenAPI Spec**: `docs/openapi.yaml` +- **SDK Documentation**: `sdk/ts/README.md` + +## Testing + +The repository includes: +- Unit tests for handlers (`*_test.go`) +- Test utilities (`internal/testutil/`) +- Integration tests with embedded PostgreSQL + +Run tests: +```bash +go test ./internal/handlers/ +go test -v ./... +``` + +## Key Features + +1. **Multi-tenancy**: Organization-based resource isolation +2. **Encryption at Rest**: All sensitive data encrypted per-org +3. **Async Job Processing**: Background tasks with retry logic +4. **API Key Management**: Multiple authentication methods +5. **SSH Key Generation**: Automated keypair creation (RSA/Ed25519) +6. **DNS Automation**: Route53 integration for DNS records +7. **Kubernetes Management**: Cluster lifecycle automation +8. **Terraform Provider**: Infrastructure-as-Code support +9. **Web UI**: React-based management interface +10. **OpenAPI/Swagger**: Auto-generated API documentation + +## Architecture Patterns + +- **Repository Pattern**: Data access abstraction via GORM +- **Dependency Injection**: Dependencies passed to handlers +- **Middleware Chain**: Request processing pipeline +- **Job Queue**: Async processing with Archer +- **Multi-tenant**: Organization-scoped data isolation +- **Encryption**: Key hierarchy (master → org → resource) + +## Future Enhancements + +Based on commented code and structure: +- Full cluster provisioning automation +- Additional cloud provider support +- Enhanced monitoring and observability +- Cluster backup and restore +- Advanced RBAC controls +- Custom resource definitions + +## Resources + +- **GitHub**: https://github.com/GlueOps/autoglue +- **Production API**: https://autoglue.glueopshosted.com/api/v1 +- **Pre-prod API**: https://autoglue.glueopshosted.rocks/api/v1 +- **Staging API**: https://autoglue.apps.nonprod.earth.onglueops.rocks/api/v1