The Challenge
Built a multi-tenant Azure Kubernetes platform for Capgemini
to enable 2,500+ distributed developers to ship faster without
operational friction — resulting in a published architecture reference
model with Microsoft.
2. Situation: Business Context
Industry & Stakeholders
Global consulting organization (Capgemini) scaling a global
engineering function across EMEA, APAC, and Americas. Stakeholders:
CTO, VP Engineering, Platform Engineering leadership, 2,500+
developers, compliance/security teams.
The Problem
Thousands of engineers working in distributed teams needed a unified
development platform. The legacy approach had become a bottleneck:
- No standardized developer experience across regions
-
Infrastructure provisioning took weeks (manual approval gates)
- Highly heterogeneous tech stack increasing operational burden
-
Security & governance controls inconsistently applied across
teams
-
No self-service model — every infrastructure change required ops
team intervention
Business Impact
-
Time-to-production: 3–4 weeks from
request to running environment
-
Engineering velocity: Teams blocked
waiting on infrastructure decisions
-
Cost visibility: No clear cost
attribution by team or project
-
Compliance risk: Governance
inconsistency across workloads
3. Task: Requirements & Constraints
Business Objectives
-
Enable global teams to provision their own infrastructure in
minutes, not weeks
- Establish governance guardrails that scale with the platform
-
Publish architecture as an industry reference model (brand/thought
leadership)
Functional Requirements
- Multi-tenant Kubernetes cluster with tenant isolation
- Self-service infrastructure provisioning (developer-facing)
- Support both stateless (microservices) and stateful workloads
- Integration with enterprise identity (Azure AD)
- Audit logging for compliance (SOC 2, ISO 27001)
Non-Functional Requirements (NFRs)
Availability
99.95% uptime SLA; multi-region failover for critical workloads
Scalability
Support 2,500+ developers, hundreds of microservices, 1000s of pods
Performance
Pod startup time <2 minutes; rolling deployments zero-downtime
Security
Network policies, RBAC, container scanning, encrypted secrets
management
Compliance
GDPR, ISO 27001, SOC 2; audit trail for all provisioning actions
Observability
Metrics, logs, traces; developer-facing dashboards
Cost Control
Per-tenant cost allocation; automatic resource scaling
Constraints
-
Azure commitment: Strategic
partnership with Microsoft (no AWS/GCP)
-
Regulatory: GDPR compliance data
residency requirements
-
Organizational: 2–3 platform
engineers initially; had to scale to team of 8+
-
Timeline: MVP in 6 months; full
rollout within 18 months
Success Criteria
-
Provisioning time reduced from 3–4 weeks → <1 day (self-service)
- 100% of active projects running on platform by Year 2
-
Zero unplanned outages attributed to platform (excluding provider)
- Architecture published as reference model with Microsoft
- Developer satisfaction score >4/5 on platform tooling
4. Architecture Overview
High-Level Design
Multi-tenant Kubernetes on Azure, deployed across regions with:
-
Shared cluster infrastructure (AKS — Azure Kubernetes Service)
- Logical tenant isolation via Kubernetes namespaces + RBAC
- IaC-driven provisioning (Terraform + Helm)
- DevOps self-service through internal developer platform
- Observability integrated into every layer
Core Components
-
AKS Clusters: Multi-region (EMEA
primary, APAC failover)
-
Azure Container Registry: Centralized
image repository
-
Terraform: Infrastructure-as-code for
clusters, networking, storage
-
Helm: Application templating &
deployment
-
Azure DevOps: CI/CD pipelines,
artifact management
-
Azure Monitor / Application Insights:
Observability
-
Azure Key Vault: Secrets management
-
Azure Policy: Cloud governance &
compliance enforcement
Key Technologies / Stack
Compute
Azure Kubernetes Service (AKS); node pools by workload type
Storage
Azure Managed Disks, Azure File Share (stateful workloads), Cosmos
DB for NoSQL
Networking
Azure VNet, Network Policies, Azure Load Balancer, Application
Gateway
Infrastructure as Code
Terraform for cluster bootstrap; Helm for application layer
CI/CD
Azure DevOps Pipelines; automated testing, security scanning,
deployment
Security
Pod Security Policies, RBAC, network policies, Falco runtime
monitoring
Observability
Prometheus + Grafana for metrics; ELK stack for logs; Jaeger for
traces
5. Architecture Reasoning: Principal-Level Decision Making
Problem Framing
Primary Business Driver: Speed of
developer self-service without sacrificing governance at global scale
Dominant Quality Attributes:
- Developer Experience (fast provisioning, clear APIs)
- Multi-tenancy with strong isolation (security/compliance)
- Operational Automation (reduce manual toil)
Architectural Hypothesis
If we implement a
standards-based multi-tenant Kubernetes on Azure with IaC-driven
self-service, we will achieve
1-day provisioning cycles for developers, because
Kubernetes namespaces + RBAC provide logical isolation while
removing manual gates, while accepting
operational complexity in cluster management and initial team
scaling.
Option Space Considered
Option A: Kubernetes Multi-Tenancy (Chosen)
Description: Shared AKS clusters with
namespace-based logical isolation, self-service provisioning via IaC
Strengths:
- Cost-efficient (shared infrastructure)
- Standardized developer experience
- Rapid provisioning (<1 day via automation)
- Industry-standard (cloud-agnostic knowledge)
Weaknesses:
- Operational complexity (cluster management, upgrades)
- Requires expertise in Kubernetes & cloud security
-
Noisy neighbor problem (one tenant's workload can impact others if
not properly resource-bounded)
Option B: Fully Isolated Clusters Per Tenant
Description: Separate AKS cluster per
tenant; full isolation at infrastructure level
Strengths:
- Maximum isolation (security/blast radius)
- Simpler governance (each cluster isolated policy)
- No noisy neighbor problems
Weaknesses:
-
Significant cost overhead (2,500 devs → ~50+ clusters minimum)
- Operational burden (patch/upgrade each cluster)
- Longer provisioning (each cluster stack takes 30+ mins)
Option C: Serverless / Functions-as-a-Service (Azure Functions)
Description: Fully managed serverless
compute; no cluster management
Strengths:
- Zero operational overhead
- Auto-scaling by demand
- Pay-per-execution cost model
Weaknesses:
-
Not suitable for stateful workloads (majority of Capgemini's
portfolio)
- Cold-start latency unacceptable for real-time services
- Vendor lock-in (Azure Functions only)
Decision Drivers (Ranked)
-
Developer Experience: Self-service
provisioning <1 day required
-
Cost Efficiency: Global scale
requires shared infrastructure
-
Operational Feasibility: Team size
constraints (can't manage 50+ clusters)
-
Security / Compliance: Multi-tenancy
with governance controls
-
Vendor Strategy: Azure commitment +
Microsoft partnership opportunity
-
Industry Alignment: Kubernetes is
industry standard = portable knowledge
Why This Stack?
-
Kubernetes — proven multi-tenancy
model, industry standard, enables self-service
-
Azure — strategic partnership, GDPR
compliant, mature enterprise features (Policy, DevOps)
-
Terraform + IaC — reproducible
infrastructure, version control, self-service API
-
Azure DevOps CI/CD — internal
consistency (Azure first), integrated governance
Trade-Offs Made (Explicitly)
Trade-Off 1: Shared Clusters vs. Tenant Isolation
Optimization: Cost reduction &
operational simplicity through shared infrastructure
Compromise: Reliance on Kubernetes
RBAC + network policies for isolation (rather than hard
infrastructure boundaries)
Risk Introduced: Misconfigured RBAC
or policies could allow cross-tenant access; noisy neighbor (one
team's heavy workload affects others)
Mitigation: Strict policy-as-code
(Azure Policy + OPA/Rego); resource quotas per namespace; continuous
audit logging
Trade-Off 2: Operational Complexity vs. Rapid Provisioning
Optimization: Automated self-service
provisioning in <1 day
Compromise: Kubernetes cluster
operations require specialized expertise (upgrades, security
patches, troubleshooting)
Risk Introduced: Initial 2–3 platform
engineers insufficient; cluster outages can impact entire
organization
Mitigation: Invest in SRE hiring;
establish on-call rotation; detailed runbooks; post-mortems on
incidents
Trade-Off 3: Azure-Only Lock-In vs. Enterprise Simplicity
Optimization: Simplified toolchain
(all Azure), strategic partnership value
Compromise: Future multi-cloud
expansion requires significant refactoring
Risk Introduced: Vendor lock-in to
Azure; potential cost increases; limited negotiating power
Mitigation: Kubernetes abstraction
for app layer reduces lock-in; evaluate multi-cloud after 3-year
horizon
Validation of the Hypothesis
-
POC (Months 1–2): Built single AKS
cluster with 3 teams; validated provisioning time <1 hour
-
Load Testing (Months 3–4): Simulated
500 pods; confirmed 99.98% availability under sustained load
-
Cost Modeling: Calculated $X per
cluster/month; shared model costs $Y/team (80% savings vs. isolated
clusters)
-
Security Review: Third-party audit
confirmed RBAC and network policy configuration; zero findings
-
Monitoring Metrics: Established
SLO/SLA dashboard tracking availability, provisioning time, cost per
workload
What Would Change Today
-
Improvement 1: Adopt GitOps (Flux)
earlier — would reduce manual deployment processes
-
Improvement 2: Implement service mesh
(Istio) from day one — would simplify multi-tenancy security
policies
-
Improvement 3: Platform-as-a-Product
team structure from start — would focus on DX earlier
What Would Keep
- Kubernetes + Azure decision (still correct trade-off)
- IaC-first approach (Terraform layer was solid)
- Namespace-based isolation model (scales well)
6. Implementation Highlights
Migration Strategy
Phased rollout: Start with greenfield projects → migrate legacy apps →
full production cutover
- Wave 1 (Months 1–3): 3 pilot teams, greenfield workloads
- Wave 2 (Months 4–9): 20+ teams, mixed greenfield + legacy
-
Wave 3 (Months 10–18): Full production migration, legacy system
decommission
Deployment Model
Multi-region active-active; automated failover; canary deployments for
zero-downtime updates
Infrastructure as Code
Terraform modules for cluster bootstrap, networking, storage; Helm
charts for application deployments
CI/CD
Azure DevOps Pipelines; automated testing, SAST/DAST, container
scanning, policy validation before deployment
Observability
Prometheus + Grafana dashboards per tenant; ELK stack for centralized
logging; synthetic monitoring for critical paths
Security Implementation
Pod security policies; Azure Policy enforcement; Falco runtime
monitoring; quarterly security audits
Cost Governance
Chargeback model based on CPU/memory/storage consumption; automated
cost reporting to finance
Documentation & Runbooks
Comprehensive runbooks for troubleshooting; team onboarding guides;
architecture decision records (ADRs)
7. Results: Measured Impact
Provisioning Time
Before: 3–4 weeks (manual approval
gates); After: <24 hours (automated
self-service)
Developer Experience
4.2/5 satisfaction score on tooling survey; 2,500+ developers enabled
on platform within 18 months
Operational Efficiency
99.97% uptime (exceeding 99.95% SLA); zero unplanned outages in Year 2
Cost Foundation
Shared model reduced per-team infrastructure cost by 75% vs. isolated
approach
Market Impact
Published Reference Model: Jointly
authored whitepaper with Microsoft on enterprise Kubernetes platform
design
Thought Leadership
Spoke at Kubernetes conferences; architecture became case study for
"multi-tenancy at scale"
Team Growth
Platform engineering team expanded from 2 → 8 architects; established
SRE practice
8. Lessons Learned
Technical Insights
-
Kubernetes complexity grows non-linearly with scale; invest early in
SRE expertise
-
RBAC is powerful but error-prone; automate policy validation through
policy-as-code
-
Network policies are critical for multi-tenancy; budget 2–3 sprints
for correct design
Organizational Insights
-
Platform engineering needs clear separation from ops; they're
different disciplines
-
Developer feedback loops drive success; establish early metrics on
provisioning time & satisfaction
-
Executive alignment on strategy (Azure-first) enables rapid
decision-making
Risk Management Insights
-
Shared clusters introduce blast radius — invest in circuit breakers
& resource quotas early
-
Compliance audits should inform architecture from day one, not
retrofit late
Future Evolution Path
Next phase: Service mesh (Istio) for traffic management; GitOps (Flux)
for declarative deployments; AI/ML workload optimization (GPU
scheduling improvements)
Quick Principal-Level Summary
Key Decision Statement
We optimized for
developer self-service under cost constraints, accepting
operational complexity in cluster management, which resulted in
2,500+ developers shipping 90% faster with published architecture
leadership.
Azure Kubernetes
Multi-Tenancy
Platform Engineering
Infrastructure as Code
Governance at Scale
Microsoft Partnership
Capgemini