Skip to content

Operating

Day-2 guidance for running the operator in production.

The operator surfaces its state in three places.

Logs. Structured logs, controlled by RUST_LOG (default info,stepscale_operator=debug). Each reconcile tick logs the license state, whether this replica is the leader, and the metrics source in use:

Terminal window
kubectl logs -n <namespace> deploy/<release>-stepscale-autoscaler -f

Useful log lines to watch for:

Log message (substring)Meaning
operator startingStartup; reports provider, namespaces, Prometheus, forecasting, licensed.
metrics source: Prometheus history backfillPrometheus is wired correctly.
metrics source: HPA-status fallbackNo Prometheus URL - history rebuilds slowly.
reconcile tickA tick ran; includes license=… and leader=….
created ScalingRecommendationA new recommendation was emitted.
applied recommendation / auto-reverted degraded recommendationApply / rollback occurred.
not leader; skipping mutating passThis replica is a follower (expected with replicaCount > 1).

Recommendation status. The authoritative state of any change is the CR’s status.phase and status.detail (see Usage §6.5):

Terminal window
kubectl get scalerec -A
kubectl get scalerec <name> -n <namespace> -o jsonpath='{.status}{"\n"}'

Kubernetes resources. Inspect the live workload to confirm an applied change:

Terminal window
kubectl get hpa <name> -n <namespace> \
-o custom-columns=MIN:.spec.minReplicas,MAX:.spec.maxReplicas,TARGET:.spec.metrics[0].resource.target.averageUtilization

Run two or more replicas and keep leader election enabled (the default):

Terminal window
helm upgrade <release> oci://ghcr.io/stepscale/charts/stepscale-autoscaler \
--version <version> --namespace <namespace> --reuse-values \
--set replicaCount=2
  • Leader election uses a coordination.k8s.io Lease (leaderElection.leaseName, default stepscale-autoscaler-leader). Only the leader runs the mutating passes (apply, verify, schedule), so multiple replicas never double-apply.
  • The lease duration is 3× intervalSeconds. Each mutating pass is re-gated on a fresh leadership check and bounded by a time budget, so a replica that loses the lease stops mutating promptly during a failover.
  • Followers still run read-only analysis, so failover is fast - a standby is already warm.

Upgrades follow the same verify-then-install flow as the initial install:

  1. Verify the new image signature (see Installation §3.1).

  2. Mirror the new image and chart if you are air-gapped (§3.3).

  3. Upgrade in place:

    Terminal window
    helm upgrade <release> oci://ghcr.io/stepscale/charts/stepscale-autoscaler \
    --version <new-version> --namespace <namespace> --reuse-values

The CRD ships with the chart. Existing ScalingRecommendation resources and their approval state are preserved across upgrades. Pull access to new images is tied to an active subscription, which is the renewal lever - see Licensing.

ModeHow to run it (current behavior)
Analysis-only (advisor)Install without a license (or without license.publicKey). The operator watches and emits recommendations but never applies; approved recommendations are marked blocked. Equivalently, simply never approve.
ApplyProvide a valid license and license.publicKey, then approve recommendations. The operator applies and verifies them.
Rules-only (no LLM)Set llm.provider=none. Analysis uses the deterministic rule engine; no external calls are made. Combine with either mode above.
  • intervalSeconds trades responsiveness against API-server load. The default 300s is appropriate for most clusters; lower it only if you need faster turnaround and your control plane has headroom.
  • safety.probationWindowMinutes should be long enough to capture a representative traffic sample for the workload. For spiky daily traffic, keep it at or above the default.
  • safety.healthCpuMargin widens or tightens what counts as “degraded.” Raise it to tolerate more post-change CPU headroom before rolling back; lower it for stricter reverts.