ClawSkills logoClawSkills

Kubernetes

Comprehensive Kubernetes and OpenShift cluster management skill covering operations, troubleshooting, manifest generation, security, and GitOps. Use this skill

Introduction

# Kubernetes & OpenShift Cluster Management

Comprehensive skill for Kubernetes and OpenShift clusters covering operations, troubleshooting, manifests, security, and GitOps.

## Current Versions (January 2026)

| Platform | Version | Documentation | |----------|---------|---------------| | **Kubernetes** | 1.31.x | https://kubernetes.io/docs/ | | **OpenShift** | 4.17.x | https://docs.openshift.com/ | | **EKS** | 1.31 | https://docs.aws.amazon.com/eks/ | | **AKS** | 1.31 | https://learn.microsoft.com/azure/aks/ | | **GKE** | 1.31 | https://cloud.google.com/kubernetes-engine/docs |

### Key Tools

| Tool | Version | Purpose | |------|---------|---------| | **ArgoCD** | v2.13.x | GitOps deployments | | **Flux** | v2.4.x | GitOps toolkit | | **Kustomize** | v5.5.x | Manifest customization | | **Helm** | v3.16.x | Package management | | **Velero** | 1.15.x | Backup/restore | | **Trivy** | 0.58.x | Security scanning | | **Kyverno** | 1.13.x | Policy engine |

## Command Convention

**IMPORTANT**: Use `kubectl` for standard Kubernetes. Use `oc` for OpenShift/ARO.

---

## 1. CLUSTER OPERATIONS

### Node Management

```bash # View nodes kubectl get nodes -o wide

# Drain node for maintenance kubectl drain ${NODE} --ignore-daemonsets --delete-emptydir-data --grace-period=60

# Uncordon after maintenance kubectl uncordon ${NODE}

# View node resources kubectl top nodes ```

### Cluster Upgrades

**AKS:** ```bash az aks get-upgrades -g ${RG} -n ${CLUSTER} -o table az aks upgrade -g ${RG} -n ${CLUSTER} --kubernetes-version ${VERSION} ```

**EKS:** ```bash aws eks update-cluster-version --name ${CLUSTER} --kubernetes-version ${VERSION} ```

**GKE:** ```bash gcloud container clusters upgrade ${CLUSTER} --master --cluster-version ${VERSION} ```

**OpenShift:** ```bash oc adm upgrade --to=${VERSION} oc get clusterversion ```

### Backup with Velero

```bash # Install Velero velero install --provider ${PROVIDER} --bucket ${BUCKET} --secret-file ${CREDS}

# Create backup velero backup create ${BACKUP_NAME} --include-namespaces ${NS}

# Restore velero restore create --from-backup ${BACKUP_NAME} ```

---

## 2. TROUBLESHOOTING

### Health Assessment

Run the bundled script for comprehensive health check: ```bash bash scripts/cluster-health-check.sh ```

### Pod Status Interpretation

| Status | Meaning | Action | |--------|---------|--------| | `Pending` | Scheduling issue | Check resources, nodeSelector, tolerations | | `CrashLoopBackOff` | Container crashing | Check logs: `kubectl logs ${POD} --previous` | | `ImagePullBackOff` | Image unavailable | Verify image name, registry access | | `OOMKilled` | Out of memory | Increase memory limits | | `Evicted` | Node pressure | Check node resources |

### Debugging Commands

```bash # Pod logs (current and previous) kubectl logs ${POD} -c ${CONTAINER} --previous

# Multi-pod logs with stern stern ${LABEL_SELECTOR} -n ${NS}

# Exec into pod kubectl exec -it ${POD} -- /bin/sh

# Pod events kubectl describe pod ${POD} | grep -A 20 Events

# Cluster events (sorted by time) kubectl get events -A --sort-by='.lastTimestamp' | tail -50 ```

### Network Troubleshooting

```bash # Test DNS kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# Test service connectivity kubectl run -it --rm debug --image=curlimages/curl -- curl -v http://${SVC}.${NS}:${PORT}

# Check endpoints kubectl get endpoints ${SVC} ```

---

## 3. MANIFEST GENERATION

### Production Deployment Template

```yaml apiVersion: apps/v1 kind: Deployment metadata: name: ${APP_NAME} namespace: ${NAMESPACE} labels: app.kubernetes.io/name: ${APP_NAME} app.kubernetes.io/version: "${VERSION}" spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app.kubernetes.io/name: ${APP_NAME} template: metadata: labels: app.kubernetes.io/name: ${APP_NAME} spec: serviceAccountName: ${APP_NAME} securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 1000 seccompProfile: type: RuntimeDefault containers: - name: ${APP_NAME} image: ${IMAGE}:${TAG} ports: - name: http containerPort: 8080 securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: ["ALL"] resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi livenessProbe: httpGet: path: /healthz port: http initialDelaySeconds: 10 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: http initialDelaySeconds: 5 periodSeconds: 5 volumeMounts: - name: tmp mountPath: /tmp volumes: - name: tmp emptyDir: {} affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/name: ${APP_NAME} topologyKey: kubernetes.io/hostname ```

### Service & Ingress

```yaml apiVersion: v1 kind: Service metadata: name: ${APP_NAME} spec: selector: app.kubernetes.io/name: ${APP_NAME} ports: - name: http port: 80 targetPort: http --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: ${APP_NAME} annotations: nginx.ingress.kubernetes.io/ssl-redirect: "true" spec: ingressClassName: nginx tls: - hosts: - ${HOST} secretName: ${APP_NAME}-tls rules: - host: ${HOST} http: paths: - path: / pathType: Prefix backend: service: name: ${APP_NAME} port: name: http ```

### OpenShift Route

```yaml apiVersion: route.openshift.io/v1 kind: Route metadata: name: ${APP_NAME} spec: to: kind: Service name: ${APP_NAME} port: targetPort: http tls: termination: edge insecureEdgeTerminationPolicy: Redirect ```

Use the bundled script for manifest generation: ```bash bash scripts/generate-manifest.sh deployment myapp production ```

---

## 4. SECURITY

### Security Audit

Run the bundled script: ```bash bash scripts/security-audit.sh [namespace] ```

### Pod Security Standards

```yaml apiVersion: v1 kind: Namespace metadata: name: ${NAMESPACE} labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/audit: baseline pod-security.kubernetes.io/warn: restricted ```

### NetworkPolicy (Zero Trust)

```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: ${APP_NAME}-policy spec: podSelector: matchLabels: app.kubernetes.io/name: ${APP_NAME} policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app.kubernetes.io/name: frontend ports: - protocol: TCP port: 8080 egress: - to: - podSelector: matchLabels: app.kubernetes.io/name: database ports: - protocol: TCP port: 5432 # Allow DNS - to: - namespaceSelector: {} podSelector: matchLabels: k8s-app: kube-dns ports: - protocol: UDP port: 53 ```

### RBAC Best Practices

```yaml apiVersion: v1 kind: ServiceAccount metadata: name: ${APP_NAME} --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: ${APP_NAME}-role rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: ${APP_NAME}-binding subjects: - kind: ServiceAccount name: ${APP_NAME} roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: ${APP_NAME}-role ```

### Image Scanning

```bash # Scan image with Trivy trivy image ${IMAGE}:${TAG}

# Scan with severity filter trivy image --severity HIGH,CRITICAL ${IMAGE}:${TAG}

# Generate SBOM trivy image --format spdx-json -o sbom.json ${IMAGE}:${TAG} ```

---

## 5. GITOPS

### ArgoCD Application

```yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: ${APP_NAME} namespace: argocd finalizers: - resources-finalizer.argocd.argoproj.io spec: project: default source: repoURL: ${GIT_REPO} targetRevision: main path: k8s/overlays/${ENV} destination: server: https://kubernetes.default.svc namespace: ${NAMESPACE} syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true ```

### Kustomize Structure

``` k8s/ ├── base/ │ ├── kustomization.yaml │ ├── deployment.yaml │ └── service.yaml └── overlays/ ├── dev/ │ └── kustomization.yaml ├── staging/ │ └── kustomization.yaml └── prod/ └── kustomization.yaml ```

**base/kustomization.yaml:** ```yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - deployment.yaml - service.yaml ```

**overlays/prod/kustomization.yaml:** ```yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - ../../base namePrefix: prod- namespace: production replicas: - name: myapp count: 5 images: - name: myregistry/myapp newTag: v1.2.3 ```

### GitHub Actions CI/CD

```yaml name: Build and Deploy on: push: branches: [main]

jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Build and push image uses: docker/build-push-action@v5 with: push: true tags: ${{ secrets.REGISTRY }}/${{ github.event.repository.name }}:${{ github.sha }} - name: Update Kustomize image run: | cd k8s/overlays/prod kustomize edit set image myapp=${{ secrets.REGISTRY }}/${{ github.event.repository.name }}:${{ github.sha }} - name: Commit and push run: | git config user.name "github-actions" git config user.email "[email protected]" git add . git commit -m "Update image to ${{ github.sha }}" git push ```

Use the bundled script for ArgoCD sync: ```bash bash scripts/argocd-app-sync.sh ${APP_NAME} --prune ```

---

## Helper Scripts

This skill includes automation scripts in the `scripts/` directory:

| Script | Purpose | |--------|---------| | `cluster-health-check.sh` | Comprehensive cluster health assessment with scoring | | `security-audit.sh` | Security posture audit (privileged, root, RBAC, NetworkPolicy) | | `node-maintenance.sh` | Safe node drain and maintenance prep | | `pre-upgrade-check.sh` | Pre-upgrade validation checklist | | `generate-manifest.sh` | Generate production-ready K8s manifests | | `argocd-app-sync.sh` | ArgoCD application sync helper |

Run any script: ```bash bash scripts/<script-name>.sh [arguments] ```

More Products