ClawSkills logoClawSkills

Kubernetes

涵盖操作、故障排除、清单生成、安全和 GitOps 的全面 Kubernetes 和 OpenShift 集群管理技能。当

介绍

# Kubernetes & OpenShift Cluster Management

涵盖 Kubernetes 和 OpenShift 集群的全面技能,包括运维、故障排查、清单文件、安全和 GitOps。

## 当前版本(2026 年 1 月)

| 平台 | 版本 | 文档 | |----------|---------|---------------| | **Kubernetes** | 1.31.x | https://kubernetes.io/docs/ | | **OpenShift** | 4.17.x | https://docs.openshift.com/ | | **EKS** | 1.31 | https://docs.aws.amazon.com/eks/ | | **AKS** | 1.31 | https://learn.microsoft.com/azure/aks/ | | **GKE** | 1.31 | https://cloud.google.com/kubernetes-engine/docs |

### 关键工具

| 工具 | 版本 | 用途 | |------|---------|---------| | **ArgoCD** | v2.13.x | GitOps 部署 | | **Flux** | v2.4.x | GitOps 工具套件 | | **Kustomize** | v5.5.x | 清单定制 | | **Helm** | v3.16.x | 包管理 | | **Velero** | 1.15.x | 备份/恢复 | | **Trivy** | 0.58.x | 安全扫描 | | **Kyverno** | 1.13.x | 策略引擎 |

## 命令约定

**重要**:在标准 Kubernetes 上使用 `kubectl`。在 OpenShift/ARO 上使用 `oc`。

---

## 1. 集群运维

### 节点管理

```bash # View nodes kubectl get nodes -o wide

# Drain node for maintenance kubectl drain ${NODE} --ignore-daemonsets --delete-emptydir-data --grace-period=60

# Uncordon after maintenance kubectl uncordon ${NODE}

# View node resources kubectl top nodes ```

### 集群升级

**AKS:** ```bash az aks get-upgrades -g ${RG} -n ${CLUSTER} -o table az aks upgrade -g ${RG} -n ${CLUSTER} --kubernetes-version ${VERSION} ```

**EKS:** ```bash aws eks update-cluster-version --name ${CLUSTER} --kubernetes-version ${VERSION} ```

**GKE:** ```bash gcloud container clusters upgrade ${CLUSTER} --master --cluster-version ${VERSION} ```

**OpenShift:** ```bash oc adm upgrade --to=${VERSION} oc get clusterversion ```

### 使用 Velero 备份

```bash # Install Velero velero install --provider ${PROVIDER} --bucket ${BUCKET} --secret-file ${CREDS}

# Create backup velero backup create ${BACKUP_NAME} --include-namespaces ${NS}

# Restore velero restore create --from-backup ${BACKUP_NAME} ```

---

## 2. 故障排查

### 健康评估

运行捆绑脚本以进行全面健康检查: ```bash bash scripts/cluster-health-check.sh ```

### Pod 状态解读

| 状态 | 含义 | 操作 | |--------|---------|--------| | `Pending` | 调度问题 | 检查资源、nodeSelector、tolerations | | `CrashLoopBackOff` | 容器崩溃 | 检查日志:`kubectl logs ${POD} --previous` | | `ImagePullBackOff` | 镜像不可用 | 验证镜像名称、仓库访问权限 | | `OOMKilled` | 内存不足 | 增加内存限制 | | `Evicted` | 节点压力 | 检查节点资源 |

### 调试命令

```bash # Pod logs (current and previous) kubectl logs ${POD} -c ${CONTAINER} --previous

# Multi-pod logs with stern stern ${LABEL_SELECTOR} -n ${NS}

# Exec into pod kubectl exec -it ${POD} -- /bin/sh

# Pod events kubectl describe pod ${POD} | grep -A 20 Events

# Cluster events (sorted by time) kubectl get events -A --sort-by='.lastTimestamp' | tail -50 ```

### 网络故障排查

```bash # Test DNS kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# Test service connectivity kubectl run -it --rm debug --image=curlimages/curl -- curl -v http://${SVC}.${NS}:${PORT}

# Check endpoints kubectl get endpoints ${SVC} ```

---

## 3. 清单生成

### 生产部署模板

```yaml apiVersion: apps/v1 kind: Deployment metadata: name: ${APP_NAME} namespace: ${NAMESPACE} labels: app.kubernetes.io/name: ${APP_NAME} app.kubernetes.io/version: "${VERSION}" spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app.kubernetes.io/name: ${APP_NAME} template: metadata: labels: app.kubernetes.io/name: ${APP_NAME} spec: serviceAccountName: ${APP_NAME} securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 1000 seccompProfile: type: RuntimeDefault containers: - name: ${APP_NAME} image: ${IMAGE}:${TAG} ports: - name: http containerPort: 8080 securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: ["ALL"] resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi livenessProbe: httpGet: path: /healthz port: http initialDelaySeconds: 10 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: http initialDelaySeconds: 5 periodSeconds: 5 volumeMounts: - name: tmp mountPath: /tmp volumes: - name: tmp emptyDir: {} affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/name: ${APP_NAME} topologyKey: kubernetes.io/hostname ```

### Service 与 Ingress

```yaml apiVersion: v1 kind: Service metadata: name: ${APP_NAME} spec: selector: app.kubernetes.io/name: ${APP_NAME} ports: - name: http port: 80 targetPort: http --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: ${APP_NAME} annotations: nginx.ingress.kubernetes.io/ssl-redirect: "true" spec: ingressClassName: nginx tls: - hosts: - ${HOST} secretName: ${APP_NAME}-tls rules: - host: ${HOST} http: paths: - path: / pathType: Prefix backend: service: name: ${APP_NAME} port: name: http ```

### OpenShift Route

```yaml apiVersion: route.openshift.io/v1 kind: Route metadata: name: ${APP_NAME} spec: to: kind: Service name: ${APP_NAME} port: targetPort: http tls: termination: edge insecureEdgeTerminationPolicy: Redirect ```

使用捆绑脚本生成清单: ```bash bash scripts/generate-manifest.sh deployment myapp production ```

---

## 4. 安全

### 安全审计

运行捆绑脚本: ```bash bash scripts/security-audit.sh [namespace] ```

### Pod 安全标准

```yaml apiVersion: v1 kind: Namespace metadata: name: ${NAMESPACE} labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/audit: baseline pod-security.kubernetes.io/warn: restricted ```

### NetworkPolicy(零信任)

```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: ${APP_NAME}-policy spec: podSelector: matchLabels: app.kubernetes.io/name: ${APP_NAME} policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app.kubernetes.io/name: frontend ports: - protocol: TCP port: 8080 egress: - to: - podSelector: matchLabels: app.kubernetes.io/name: database ports: - protocol: TCP port: 5432 # Allow DNS - to: - namespaceSelector: {} podSelector: matchLabels: k8s-app: kube-dns ports: - protocol: UDP port: 53 ```

### RBAC 最佳实践

```yaml apiVersion: v1 kind: ServiceAccount metadata: name: ${APP_NAME} --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: ${APP_NAME}-role rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: ${APP_NAME}-binding subjects: - kind: ServiceAccount name: ${APP_NAME} roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: ${APP_NAME}-role ```

### 镜像扫描

```bash # Scan image with Trivy trivy image ${IMAGE}:${TAG}

# Scan with severity filter trivy image --severity HIGH,CRITICAL ${IMAGE}:${TAG}

# Generate SBOM trivy image --format spdx-json -o sbom.json ${IMAGE}:${TAG} ```

---

## 5. GITOPS

### ArgoCD 应用

```yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: ${APP_NAME} namespace: argocd finalizers: - resources-finalizer.argocd.argoproj.io spec: project: default source: repoURL: ${GIT_REPO} targetRevision: main path: k8s/overlays/${ENV} destination: server: https://kubernetes.default.svc namespace: ${NAMESPACE} syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true ```

### Kustomize 结构

``` k8s/ ├── base/ │ ├── kustomization.yaml │ ├── deployment.yaml │ └── service.yaml └── overlays/ ├── dev/ │ └── kustomization.yaml ├── staging/ │ └── kustomization.yaml └── prod/ └── kustomization.yaml ```

**base/kustomization.yaml:** ```yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - deployment.yaml - service.yaml ```

**overlays/prod/kustomization.yaml:** ```yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - ../../base namePrefix: prod- namespace: production replicas: - name: myapp count: 5 images: - name: myregistry/myapp newTag: v1.2.3 ```

### GitHub Actions CI/CD

```yaml name: Build and Deploy on: push: branches: [main]

jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Build and push image uses: docker/build-push-action@v5 with: push: true tags: ${{ secrets.REGISTRY }}/${{ github.event.repository.name }}:${{ github.sha }} - name: Update Kustomize image run: | cd k8s/overlays/prod kustomize edit set image myapp=${{ secrets.REGISTRY }}/${{ github.event.repository.name }}:${{ github.sha }} - name: Commit and push run: | git config user.name "github-actions" git config user.email "[email protected]" git add . git commit -m "Update image to ${{ github.sha }}" git push ```

使用捆绑脚本进行 ArgoCD 同步: ```bash bash scripts/argocd-app-sync.sh ${APP_NAME} --prune ```

---

## 辅助脚本

此技能在 `scripts/` 目录中包含自动化脚本:

| 脚本 | 用途 | |--------|---------| | `cluster-health-check.sh` | 包含评分的全面集群健康评估 | | `security-audit.sh` | 安全态势审计(特权、root、RBAC、NetworkPolicy)| | `node-maintenance.sh` | 安全的节点驱逐和维护准备 | | `pre-upgrade-check.sh` | 升级前验证检查清单 | | `generate-manifest.sh` | 生成生产就绪的 K8s 清单 | | `argocd-app-sync.sh` | ArgoCD 应用同步辅助工具 |

运行任何脚本: ```bash bash scripts/<script-name>.sh [arguments] ```

更多产品