Running AEM on Kubernetes (EKS): the patterns that actually work in production

Everyone is containerising everything. AEM is not “everything.” It’s a stateful, JCR-backed, Oak-persistent application with specific memory requirements and cold-start times measured in minutes, not seconds. Running it on Kubernetes requires rethinking several assumptions.

Why we moved AEM to EKS

The driver wasn’t fashion — it was operational cost and deployment consistency. We had AEM running on dedicated EC2 instances, with different AMIs across environments, manual patching, and snowflake configurations that made “works in staging” meaningless. EKS solved the consistency problem. The persistence problem required more careful design.

The JCR persistence challenge

AEM’s JCR (Java Content Repository) is file-system backed. Pods are ephemeral. This is the fundamental tension. You cannot run AEM author as a stateless pod — you need persistent storage that survives pod restarts, rescheduling, and node failures.

# EKS storage config for AEM author
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: aem-author-repository
spec:
  accessModes:
    - ReadWriteOnce       # author is single-writer
  storageClassName: gp3   # AWS EBS gp3 — better IOPS/cost than gp2
  resources:
    requests:
      storage: 100Gi      # size for JCR + Oak compaction headroom

AEM publish instances are read-heavy and can run with ReadOnlyMany mounts from shared S3-backed storage (via EFS) for the content repository, while keeping the Oak store on ReadWriteOnce EBS per pod.

Health checks that actually work

AEM’s startup time is 3–8 minutes depending on instance size and repository compaction state. Default Kubernetes liveness probes will kill the pod before it’s ready if configured too aggressively. Use startup probes with generous failureThreshold to give AEM time to boot:

startupProbe:
  httpGet:
    path: /libs/granite/core/content/login.html
    port: 4502
  failureThreshold: 40      # 40 × 15s = 10 minutes startup window
  periodSeconds: 15

livenessProbe:
  httpGet:
    path: /system/health   # Sling health check servlet
    port: 4502
  initialDelaySeconds: 0
  periodSeconds: 30
  failureThreshold: 3

Memory configuration — the number one mistake

AEM is a JVM application. The JVM has its own memory management separate from container memory limits. If your container limit is 8Gi but your JVM heap is set to 4Gi, you have 4Gi left for the OS, metaspace, native memory, and Oak’s off-heap segment cache. Size it wrong and the OOMKiller terminates your pod under load.

Rule of thumb: container memory limit = JVM heap × 1.5 + 1Gi overhead. For AEM author in production, we run 12Gi containers with -Xmx6g -Xms6g and 2Gi Oak segment cache.

What worked, what didn’t

Worked: EBS gp3 persistent volumes for JCR, startup probes with generous windows, separate node groups for AEM workloads (memory-optimised EC2 instances), Horizontal Pod Autoscaler on publish tier based on CPU.

Didn’t work: Trying to run author as more than one replica (Oak doesn’t support concurrent writes to the same repository), using EFS for the primary Oak store (latency too high for random I/O patterns), relying on default JVM ergonomics inside containers (always set explicit heap sizes).

AEM on Kubernetes is absolutely viable in production. It just requires understanding both the JVM runtime and the Kubernetes primitives deeply enough to bridge them correctly.

Leave a Comment

Your email address will not be published. Required fields are marked *