Infrastructure as Code: define and maintain the entire infrastructure in code using Terraform and Ansible; all changes go through code review and CI validation — no manual provisioning.
GitOps implementation: own GitOps workflows with ArgoCD: application state management, automated sync, progressive delivery, and rollback strategies across all environments.
Observability stack: build and operate the full observability layer: metrics (Prometheus + Grafana), logging (EFK / Loki), distributed tracing (Jaeger / Tempo), and alerting; define SLOs and error budgets.
CI/CD pipeline engineering: build and maintain end- to- end pipelines — build, test, security scan, staging, and production — with zero- downtime deployments using GitLab CI, Tekton, and ArgoCD.
Multi- layer security: implement and maintain network security (firewall rules, NetworkPolicies, ingress/egress controls, VPN tunnels to government systems); container security (image scanning with Trivy/Clair, admission controllers, Pod Security Standards, runtime threat detection with Falco); secrets management (HashiCorp Vault, automated secret rotation); and compliance hardening (CIS Kubernetes Benchmark, DISA STIG).
Security incident response: participate in threat detection, forensic investigation, and post- mortem analysis for security incidents; maintain and rehearse incident response runbooks.
Kubernetes workload management: manage the full lifecycle of 20+ microservices across namespaces: RBAC, NetworkPolicy, resource quotas, Custom Resources (CRDs), Operators, and admission webhooks.
Cross- functional collaboration: work closely with Backend and Mobile teams to optimize deployment workflows, resolve infrastructure bottlenecks, and ensure platform capabilities ship reliably end- to- end.
Platform provisioning & operations: deploy and operate production clusters on Red Hat OpenShift (OCP 4.x) or VMware Cloud Foundation 9 — configure multi- zone topology, high availability, and disaster recovery; own the platform uptime SLA.
Capacity planning & scaling: lead capacity planning exercises; configure and tune auto- scaling (HPA, VPA, Cluster Autoscaler) to handle traffic spikes without over- provisioning.
Service mesh: design and operate Istio / OpenShift Service Mesh: enforce mTLS between services, manage traffic routing, implement canary deployments, and configure circuit breaking.