KEDA: Event-Driven Autoscaling for Kubernetes

KEDA: Event-Driven Autoscaling for Kubernetes
Alberto Savarese
Alberto Savarese
6 min read


Introduction: Why KEDA Is Relevant Today


In recent years, Kubernetes has standardized autoscaling through the Horizontal Pod Autoscaler (HPA), mainly based on CPU and memory metrics. This approach works well for synchronous, CPU-bound workloads, but it shows clear limitations in event-driven architectures — such as queue consumers, stream processors, asynchronous jobs, and integrations with external systems.

In these scenarios, the true signal of load is not CPU utilization, but the work backlog: queued messages, Kafka lag, pending events, and application metrics. This is where KEDA comes into play: a Kubernetes-native component that enables autoscaling driven by real events, including scale-to-zero — something impossible with pure HPA.

This article is designed for DevOps engineers, Platform Engineers, and SREs who want to deeply understand KEDA and implement it correctly in production, avoiding common mistakes and understanding architectural trade-offs.

What KEDA Is — and What It Is Not

KEDA (Kubernetes Event-Driven Autoscaling) is a Kubernetes operator that extends HPA by providing external metrics derived from event systems.

It is important to clarify a few points immediately:

  • KEDA does not replace HPA → it extends it.
  • KEDA does not execute workload → it decides when and how much to scale.
  • KEDA is not an event source → it observes existing event sources.

The key value of KEDA is its ability to:

  • scale from 0 to N pods,
  • base scaling on real event sources (RabbitMQ, Kafka, Prometheus, cron, etc.),
  • integrate natively with the Kubernetes API.

KEDA and GenAI Workloads: Why It Matters for LLMs on Kubernetes

KEDA’s value becomes even more evident when observing next-generation workloads, particularly those related to Generative AI and Large Language Models (LLMs).

As recent literature on deploying GenAI systems on Kubernetes highlights, these workloads often exhibit highly irregular patterns: long periods of inactivity followed by sudden bursts of high-compute inference requests.

In such scenarios, traditional metrics like CPU or memory do not represent real workload in a timely or reliable way.

For LLM services, the true signal of pressure is not resource utilization after the fact, but incoming demand — inference requests, asynchronous job queues, pipelines orchestrating retrieval, embedding, and generation.

Scaling only when CPUs or GPUs are already saturated means reacting too late, with direct impacts on latency and cost.

KEDA enables a different approach: it allows scaling of components in a GenAI architecture based on the events that generate work, not on the consumption of resources that execute it. This makes it possible to:

  • anticipate load spikes,
  • reduce over-provisioning of expensive resources (such as GPUs),
  • scale inference services to zero when they are not used.

In this sense, KEDA is not just an autoscaler for event-driven microservices, but an architectural enabler for efficient execution of LLMs on Kubernetes, especially when combined with HPA and cluster autoscaling.

This same principle — scaling on real demand rather than resource utilization — underlies KEDA’s architecture and its integration with HPA.

Architecture of KEDA (How It Really Works)

https://keda.sh/docs/2.18/concepts/

KEDA’s architecture is intentionally simple and aligned with Kubernetes patterns.

Main Components

  1. Scaler
    • Each scaler is responsible for a specific event source.
    • Examples include RabbitMQ, Kafka, Azure Service Bus, Prometheus, and Cron.
    • It retrieves a semantic metric (for example, queue length or consumer lag).
  2. Controller
    • Watches the Custom Resources (ScaledObject, ScaledJob).
    • Decides when to enable or disable scaling.
  3. Metrics Adapter
    • Exposes metrics through the External Metrics API.
    • Allows the HPA to consume them as if they were native Kubernetes metrics.
  4. Admission Webhook
    • Validates configurations (CRDs)
    • Prevents structural errors at runtime.

Integration with HPA

The actual scaling flow works as follows:

  1. You define a ScaledObject.
  2. KEDA automatically creates an HPA.
  3. KEDA manages:
    • scaling 0 → 1
    • scaling 1 → 0
  4. HPA manages:
    • scaling 1 → N

This design avoids reinventing autoscaling and preserves full compatibility with the Kubernetes ecosystem.

ScaledObject and ScaledJob: Choosing the Right Tool

ScaledObject

Use it when:

  • the workload is long-running,
  • it scales a Deployment or StatefulSet,
  • you want consumers always ready when events arrive.

Features:

  • supports scale to zero,
  • dynamic replica count,
  • perfect for microservices and consumers.

ScaledJob

Use it when:

  • each event is an independent unit of work,
  • you want real Kubernetes Jobs,
  • the pod must terminate after processing.

Features:

  • 1 event → 1 Job,
  • ideal for batch processing, ETL, asynchronous tasks,
  • automatic cleanup of completed jobs.

Key Trade-off:

ScaledJob provides perfect isolation but increases scheduling overhead.
ScaledObject is more efficient for continuous workloads.

Installing KEDA

Recommended Method: Helm

helm repo add kedacore https://kedacore.github.io/charts

helm repo update

kubectl create namespace keda

helm install keda kedacore/keda --namespace keda

Why Helm in Production

  • controlled versioning,
  • easy upgrades and rollbacks,
  • clean CRD management.

Alternative: YAML Manifests

Useful for quick tests or extremely controlled environments:

kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.18.2/keda-2.18.2.yaml

Anatomy of a ScaledObject (Understanding It Line by Line)

Core Parameters

FIELD WHY IT MATTERS
scaleTargetRef Links KEDA to the workload
pollingInterval Event polling frequency
cooldownPeriod Scaling stability
minReplicaCount Enables scale to zero
maxReplicaCount Protection against runaway scaling
idleReplicaCount Number of “idle” replicas (if you don’t want to scale to zero)

Advanced Section

This section controls fine-grained scaling behavior.

restoreToOriginalReplicaCount

  • true: restores the original Deployment replica count
  • false: uses ScaledObject limits

horizontalPodAutoscalerConfig.behavior

  • avoids aggressive oscillations,
  • allows immediate scale-up and controlled scale-down.

In production, configuring behavior is strongly recommended.

Real-World Example: Autoscaling a RabbitMQ Consumer

Goal

  • Java consumer running on Kubernetes
  • RabbitMQ as the event source
  • Automatic scaling from 0 → N → 0

Scaling Logic

KEDA calculates:

desiredReplicas = ceil(queueLength / targetValue)

Example:

  • 100 messages
  • target = 5
  • → 20 pods

This approach:

  • maximizes parallelism,
  • avoids over-scaling,
  • preserves operational predictability.

Why Cooldown Is Essential

Without a cooldownPeriod:

  • continuous scale-up / scale-down cycles,
  • instability,
  • unnecessary costs.

Authentication in KEDA: A Structural Concern, Not a Detail

When KEDA is introduced into a Kubernetes cluster, autoscaling is no longer a purely “internal” cluster concern.

By definition, KEDA scalers must communicate with external systems: message brokers, databases, cloud APIs, and metrics exposed outside the cluster.

This fundamentally changes the security model.

With Classic HPA:

  • metrics are local (CPU, memory),
  • no application credentials are involved.

With KEDA:

  • each scaler must authenticate against an external system,
  • those credentials become part of the scaling control plane.

Treating authentication as a configuration detail is one of the most common — and most dangerous — mistakes when running KEDA in production.

Core Principle: Separate Scaling, Workload, and Secrets

KEDA intentionally introduces an indirection layer between:

  • the workload being scaled (Deployment / Job),
  • the scaling logic (ScaledObject / ScaledJob),
  • the credentials required to read metrics.

This principle is critical for three reasons:

  1. Security
    • No credentials hardcoded in application YAMLs.
  2. Maintainability
    • Rotating passwords or tokens does not require workload changes.
  3. Observability and Audit
    • It is clear who accesses what and why.

This is where Secret, TriggerAuthentication, and authenticationRef come into play.

Authentication Flow in KEDA

Conceptually, the flow is:

  1. KEDA reads the ScaledObject
  2. Finds an authenticationRef
  3. Resolves it to a TriggerAuthentication
  4. The TriggerAuthentication:
    • reads one or more Kubernetes Secrets (or an external secret)
  5. Credentials are used only by the scaler
  6. The scaled Deployment is unaware of this process

This is a crucial point: the scaled pod does not need to know the credentials used for scaling.


This is the reference pattern for real environments.

1. Kubernetes Secret

The Secret contains only sensitive data, no logic:

  • host
  • username
  • password
  • connection string
  • token
apiVersion: v1
kind: Secret
metadata:
  name: rabbitmq-secret
type: Opaque
data:
  host: <base64>
  username: <base64>
  password: <base64>

The Secret lives in the application namespace, follows RBAC rules, and can be rotated without touching KEDA.

2. TriggerAuthentication

TriggerAuthentication maps Secrets to the parameters expected by the scaler.

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: rabbitmq-trigger-auth
spec:
  secretTargetRef:
  - parameter: host
    name: rabbitmq-secret
    key: host
  - parameter: username
    name: rabbitmq-secret
    key: username
  - parameter: password
    name: rabbitmq-secret
    key: password

This is the most important part:

  • parameter is not arbitrary,
  • it must exactly match what the RabbitMQ scaler expects.

This makes authentication explicit, typed, and verifiable

3. ScaledObject with authenticationRef

The ScaledObject simply states:

“For this trigger, use that identity”.

triggers:
- type: rabbitmq
  metadata:
    queueName: queue001
    mode: QueueLength
    value: "5"
  authenticationRef:
    name: rabbitmq-trigger-auth

The result is a clean separation:

  • ScaledObject → scaling logic
  • TriggerAuthentication → identity
  • Secret → sensitive data 


Why Avoid Inline Authentication (Except for Demos)

KEDA also supports faster configurations, such as direct secretKeyRef inside triggers.

Useful for tests and PoCs, but with clear limitations:

  • couples scaling logic and credentials,
  • complicates credential rotation,
  • reduces readability of the flow.

In shared or regulated environments, this approach does not scale organizationally.

ClusterTriggerAuthentication: Multi-Namespace Scaling

In enterprise environments it is common to have:

  • multiple namespaces,
  • multiple teams,
  • a single shared broker or external service.

Standard TriggerAuthentication is namespace-scoped.
To avoid duplication, KEDA introduces ClusterTriggerAuthentication.

OBJECT SCOPE
TriggerAuthentication Namespace
ClusterTriggerAuthentication Cluster-wide

Advantages:

  • a single identity definition,
  • reusable across multiple ScaledObjects.

The Secret itself remains in the application namespace, preserving sensitive data isolation.

External Secrets: HashiCorp Vault and Beyond

For high-compliance environments (PCI, SOC2, regulated workloads), KEDA can read secrets directly from external systems such as HashiCorp Vault.

In this scenario:

  • KEDA does not read Kubernetes Secrets,
  • it queries Vault using an authentication method (token, Kubernetes auth, etc.),
  • it dynamically resolves scaler parameters.

The trade-off is clear:

  • more security
  • more operational complexity

This is an architectural choice, not a universal requirement.


Available Scalers and When to Use Them

One of KEDA’s strengths is the abstraction of the event concept.
What matters is not where the load signal comes from, but how accurately it represents real work.

For this reason, KEDA supports many scalers, each designed to translate a specific type of event or metric into a coherent scaling decision.

Queue-Based Scalers

(RabbitMQ, AWS SQS, Azure Queue)

Ideal when message backlog directly represents system load.
The number of queued messages is a clear and immediate measure of work to be processed, making these scalers perfect for asynchronous consumers and integration pipelines.

Stream-Based Scalers

(Kafka, EventHub, Pub/Sub)

Designed for high-velocity, high-throughput systems where consumer lag is the real pressure indicator.
CPU and memory are secondary metrics; accumulated stream lag drives scaling decisions.

Metric-Based Scalers

(Prometheus, Datadog)

Useful when load is expressed via aggregated application metrics such as request count, response time, or custom counters.
Effective when no explicit queue exists but work can be represented by observable metrics.

Time-Based Scalers

(Cron)

Perfect for predictable workloads or controlled operating windows.
They allow scaling based on time, such as increasing capacity during known peak hours or reducing it during scheduled idle periods.

Database Scalers

(Redis, PostgreSQL, MySQL)

Useful when application state resides in data structures or tables.
The number of records in a specific state (for example, “pending” or “queued”) becomes the signal driving scaling, keeping the system aligned with actual work.

In KEDA, scaling is not driven by CPU usage, but by the operational truth of the system:
if a signal is measurable and represents real work, it can drive scaling.

In KEDA non si scala sulla CPU, ma sulla verità operativa del sistema: se un segnale è misurabile e rappresenta lavoro reale, può guidare lo scaling.

Example of application-metric-based autoscaling: Prometheus exposes external metrics that are consumed by KEDA and the HPA. Source: https://medium.com/building-inventa/scaling-pods-based-on-prometheus-metrics-using-keda-64686565ed11

Common Mistakes to Avoid

1. Using KEDA as a Full Replacement for HPA

KEDA is not a replacement for HPA, but an event source that extends its capabilities.
In many scenarios, especially CPU-bound or memory-bound workloads, HPA remains the most stable and predictable solution.

Best practice:
Use KEDA to translate external events (queues, streams, topics) into Kubernetes metrics and let HPA handle the final scaling decision.
This hybrid approach reduces erratic behavior and improves stability.

2. Setting an Overly Aggressive Polling Interval

A pollingInterval that is too low can:

  • increase load on external APIs (Kafka, Azure Service Bus, Prometheus),
  • introduce latency and instability in scaling,
  • generate unnecessary costs or throttling.
Best practice:
Start with conservative values (30–60 seconds) and reduce them only after observing real latency and throughput metrics.
Event-driven scaling should be timely, not hyper-reactive.

3. Not Defining maxReplicaCount

Leaving scaling without an upper bound is one of the most dangerous production mistakes:

  • uncontrolled over-scaling,
  • node and cluster saturation,
  • unexpected costs.
Best practice:
Always set maxReplicaCount based on cluster capacity, event source limits, and maximum sustainable application load.

4. Hardcoding Secrets and Credentials in YAML

Embedding credentials directly in Kubernetes manifests:

  • violates security best practices,
  • exposes secrets in repositories,
  • complicates credential rotation.
Best practice:
Use TriggerAuthentication with Kubernetes Secrets or external providers (Vault, Azure Key Vault, AWS Secrets Manager).

5. Ignoring Scale-Down Behavior

Many teams focus only on scale-up, but scale-down is often the most delicate phase:

  • overly aggressive scale-down causes frequent cold starts,
  • stateful or async cleanup workloads may lose in-flight tasks or messages.
Best practice:
Configure cooldownPeriod carefully, test behavior with intermittent workloads, and ensure the application handles shutdown and retries idempotently.

Conclusion: When KEDA Really Makes the Difference

KEDA is a powerful but targeted tool.
It should not be used everywhere, but where workload is event-driven it delivers significant benefits.

Practical Takeaways

  • Use KEDA when CPU ≠ work
  • Always combine it with HPA
  • Design scale-down before scale-up
  • Treat secrets as first-class citizens
  • Start simple, refine with advanced configuration

If Kubernetes is the operating system of the cloud, KEDA is what makes it responsive to the real world.