Kubernetes and OpenShift Troubleshooting Guide¶

1. Systematic Troubleshooting Approach¶

The 5-Step Method¶

1. IDENTIFY → What is the problem?
2. GATHER  → Collect relevant information
3. ANALYZE → Examine logs, events, metrics
4. ISOLATE → Narrow down the root cause
5. RESOLVE → Fix and verify the solution

Troubleshooting Checklist¶

2. Pod Issues¶

Pod Stuck in Pending State¶

Symptoms:

$ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
myapp-5d4b8c7f9-abc12   0/1     Pending   0          5m

Diagnosis:

# Check pod events
kubectl describe pod myapp-5d4b8c7f9-abc12

# Common causes in events:
# - "Insufficient cpu/memory"
# - "No nodes available"
# - "PersistentVolumeClaim not found"
# - "ImagePullBackOff"

Solutions:

1. Insufficient Resources:

# Check node resources
kubectl top nodes
kubectl describe nodes

# Check resource requests
kubectl get pod myapp-5d4b8c7f9-abc12 -o yaml | grep -A 5 resources

# Solutions:
# - Add more nodes
# - Reduce resource requests
# - Delete unused pods
# - Scale down other deployments

2. Node Selector/Affinity Issues:

# Check node labels
kubectl get nodes --show-labels

# Check pod node selector
kubectl get pod myapp-5d4b8c7f9-abc12 -o yaml | grep -A 5 nodeSelector

# Solution: Add required labels to nodes
kubectl label nodes node1 disktype=ssd

3. PVC Not Bound:

# Check PVC status
kubectl get pvc

# Check PV availability
kubectl get pv

# Solution: Create PV or fix storage class

CrashLoopBackOff¶

Symptoms:

$ kubectl get pods
NAME                    READY   STATUS             RESTARTS   AGE
myapp-5d4b8c7f9-abc12   0/1     CrashLoopBackOff   5          10m

Diagnosis:

# Check logs
kubectl logs myapp-5d4b8c7f9-abc12
kubectl logs myapp-5d4b8c7f9-abc12 --previous

# Check events
kubectl describe pod myapp-5d4b8c7f9-abc12

# Common causes:
# - Application error
# - Missing dependencies
# - Configuration error
# - Failed health checks
# - Insufficient permissions

Solutions:

1. Application Error:

# Debug with shell access
kubectl run debug --rm -it --image=busybox -- sh

# Or debug the actual pod
kubectl debug myapp-5d4b8c7f9-abc12 -it --image=busybox

# Check application configuration
kubectl get configmap myapp-config -o yaml
kubectl get secret myapp-secret -o yaml

2. Failed Liveness Probe:

# Adjust probe settings
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 60  # Increase if app takes time to start
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

3. Permission Issues:

# Check security context
kubectl get pod myapp-5d4b8c7f9-abc12 -o yaml | grep -A 10 securityContext

# For OpenShift, check SCC
oc describe pod myapp-5d4b8c7f9-abc12 | grep scc

ImagePullBackOff¶

Symptoms:

$ kubectl get pods
NAME                    READY   STATUS             RESTARTS   AGE
myapp-5d4b8c7f9-abc12   0/1     ImagePullBackOff   0          2m

Diagnosis:

# Check events
kubectl describe pod myapp-5d4b8c7f9-abc12

# Common errors:
# - "Failed to pull image"
# - "manifest unknown"
# - "unauthorized"
# - "connection refused"

Solutions:

1. Image Doesn't Exist:

# Verify image exists
docker pull myapp:v1.0

# Check image name in deployment
kubectl get deployment myapp -o yaml | grep image:

# Fix: Update to correct image
kubectl set image deployment/myapp myapp=myapp:v1.0

2. Authentication Required:

# Create docker registry secret
kubectl create secret docker-registry regcred \
  --docker-server=docker.io \
  --docker-username=myuser \
  --docker-password=mypass \
  --docker-email=myemail@example.com

# Add to deployment
kubectl patch deployment myapp -p '
{
  "spec": {
    "template": {
      "spec": {
        "imagePullSecrets": [{"name": "regcred"}]
      }
    }
  }
}'

3. Network Issues:

# Test connectivity from node
ssh node1
curl -I https://registry.example.com

# Check proxy settings
kubectl get configmap -n kube-system kube-proxy -o yaml

OOMKilled (Out of Memory)¶

Symptoms:

$ kubectl get pods
NAME                    READY   STATUS      RESTARTS   AGE
myapp-5d4b8c7f9-abc12   0/1     OOMKilled   3          5m

Diagnosis:

# Check memory usage
kubectl top pod myapp-5d4b8c7f9-abc12

# Check memory limits
kubectl get pod myapp-5d4b8c7f9-abc12 -o yaml | grep -A 5 resources

# Check events
kubectl describe pod myapp-5d4b8c7f9-abc12

Solutions:

# Increase memory limits
resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "1Gi"  # Increased from 512Mi

# Or investigate memory leak in application

3. Service and Networking Issues¶

Service Not Accessible¶

Diagnosis:

# Check service
kubectl get svc myapp-service
kubectl describe svc myapp-service

# Check endpoints
kubectl get endpoints myapp-service

# If no endpoints, pods don't match selector
kubectl get pods --show-labels
kubectl get svc myapp-service -o yaml | grep selector

Solutions:

1. Label Mismatch:

# Fix pod labels
kubectl label pods myapp-5d4b8c7f9-abc12 app=myapp

# Or fix service selector
kubectl patch svc myapp-service -p '{"spec":{"selector":{"app":"myapp"}}}'

2. Port Mismatch:

# Check service ports
kubectl get svc myapp-service -o yaml

# Check container ports
kubectl get pod myapp-5d4b8c7f9-abc12 -o yaml | grep containerPort

# Fix service port
kubectl patch svc myapp-service -p '
{
  "spec": {
    "ports": [{
      "port": 80,
      "targetPort": 8080
    }]
  }
}'

3. Network Policy Blocking:

# Check network policies
kubectl get networkpolicy

# Test connectivity
kubectl run test --rm -it --image=busybox -- wget -O- http://myapp-service

DNS Resolution Issues¶

Diagnosis:

# Test DNS from pod
kubectl run test --rm -it --image=busybox -- nslookup myapp-service

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

Solutions:

# Restart CoreDNS
kubectl rollout restart deployment/coredns -n kube-system

# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml

# Verify service exists
kubectl get svc myapp-service

Ingress Not Working¶

Diagnosis:

# Check ingress
kubectl get ingress
kubectl describe ingress myapp-ingress

# Check ingress controller
kubectl get pods -n ingress-nginx
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Test backend service
kubectl run test --rm -it --image=curlimages/curl -- \
  curl http://myapp-service.default.svc.cluster.local

Solutions:

1. Ingress Class Missing:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ingress
spec:
  ingressClassName: nginx  # Add this
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-service
            port:
              number: 80

2. DNS Not Pointing to Ingress:

# Get ingress IP
kubectl get ingress myapp-ingress

# Verify DNS
nslookup myapp.example.com

# Update DNS A record to point to ingress IP

4. Storage Issues¶

PVC Stuck in Pending¶

Diagnosis:

# Check PVC
kubectl get pvc
kubectl describe pvc myapp-pvc

# Check storage class
kubectl get storageclass

# Check PV availability
kubectl get pv

Solutions:

1. No Storage Class:

# Set default storage class
kubectl patch storageclass standard -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# Or specify in PVC
kubectl patch pvc myapp-pvc -p '{"spec":{"storageClassName":"standard"}}'

2. No Available PV:

# Create PV manually
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-manual
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: /mnt/data
EOF

Volume Mount Issues¶

Diagnosis:

# Check pod events
kubectl describe pod myapp-5d4b8c7f9-abc12

# Common errors:
# - "Unable to mount volumes"
# - "Volume is already attached"
# - "Permission denied"

Solutions:

1. Volume Already Attached:

# Find which pod is using the volume
kubectl get pods -o json | jq -r '.items[] | 
  select(.spec.volumes[]?.persistentVolumeClaim.claimName=="myapp-pvc") | 
  .metadata.name'

# Delete the old pod
kubectl delete pod <old-pod-name>

2. Permission Issues:

# Set fsGroup in security context
securityContext:
  fsGroup: 2000
  runAsUser: 1000

5. Node Issues¶

Node NotReady¶

Diagnosis:

# Check node status
kubectl get nodes
kubectl describe node node1

# Check node conditions
kubectl get node node1 -o yaml | grep -A 10 conditions

# SSH to node and check
ssh node1
systemctl status kubelet
journalctl -u kubelet -n 100

Solutions:

1. Kubelet Not Running:

# On the node
systemctl start kubelet
systemctl enable kubelet

# Check kubelet logs
journalctl -u kubelet -f

2. Disk Pressure:

# Check disk usage
df -h

# Clean up
docker system prune -a
kubectl delete pods --field-selector=status.phase=Failed -A

# Increase disk or add new node

3. Network Issues:

# Check CNI plugin
kubectl get pods -n kube-system | grep -E 'calico|flannel|weave'

# Restart CNI pods
kubectl delete pods -n kube-system -l k8s-app=calico-node

High Resource Usage¶

Diagnosis:

# Check node resources
kubectl top nodes
kubectl describe node node1

# Find resource-hungry pods
kubectl top pods --all-namespaces --sort-by=memory
kubectl top pods --all-namespaces --sort-by=cpu

Solutions:

# Set resource limits
kubectl set resources deployment myapp --limits=cpu=500m,memory=512Mi

# Implement HPA
kubectl autoscale deployment myapp --min=2 --max=10 --cpu-percent=70

# Add more nodes or scale down workloads

6. Deployment Issues¶

Rollout Stuck¶

Diagnosis:

# Check rollout status
kubectl rollout status deployment/myapp

# Check deployment
kubectl describe deployment myapp

# Check replica sets
kubectl get rs
kubectl describe rs myapp-5d4b8c7f9

Solutions:

1. Insufficient Resources:

# Check events
kubectl get events --sort-by=.metadata.creationTimestamp

# Scale down or add resources
kubectl scale deployment myapp --replicas=2

2. Image Pull Issues:

# Check new pods
kubectl get pods -l app=myapp

# Fix image or add pull secret
kubectl set image deployment/myapp myapp=myapp:v1.0

3. Failed Readiness Probe:

# Check pod logs
kubectl logs -l app=myapp

# Adjust probe or fix application
kubectl patch deployment myapp -p '
{
  "spec": {
    "template": {
      "spec": {
        "containers": [{
          "name": "myapp",
          "readinessProbe": {
            "initialDelaySeconds": 60
          }
        }]
      }
    }
  }
}'

Rollback Required¶

Commands:

# View rollout history
kubectl rollout history deployment/myapp

# Rollback to previous version
kubectl rollout undo deployment/myapp

# Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=2

# Pause rollout
kubectl rollout pause deployment/myapp

# Resume rollout
kubectl rollout resume deployment/myapp

7. RBAC and Permission Issues¶

Forbidden Errors¶

Diagnosis:

# Check current user
kubectl auth whoami

# Check permissions
kubectl auth can-i create pods
kubectl auth can-i create pods --as=system:serviceaccount:default:myapp-sa

# Check role bindings
kubectl get rolebindings
kubectl describe rolebinding myapp-binding

Solutions:

1. Create Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: default
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]

2. Create RoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: default
subjects:
- kind: ServiceAccount
  name: myapp-sa
  namespace: default
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

3. For OpenShift SCC Issues:

# Check SCC
oc describe pod myapp-5d4b8c7f9-abc12 | grep scc

# Add SCC to service account
oc adm policy add-scc-to-user anyuid -z myapp-sa

8. Performance Issues¶

Slow Application Response¶

Diagnosis:

# Check resource usage
kubectl top pods
kubectl top nodes

# Check HPA status
kubectl get hpa

# Check pod distribution
kubectl get pods -o wide

# Check network latency
kubectl run test --rm -it --image=nicolaka/netshoot -- \
  ping myapp-service.default.svc.cluster.local

Solutions:

1. Scale Up:

# Manual scaling
kubectl scale deployment myapp --replicas=5

# Enable HPA
kubectl autoscale deployment myapp --min=3 --max=10 --cpu-percent=70

2. Optimize Resources:

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "1000m"
    memory: "1Gi"

3. Add Caching:

# Add Redis sidecar
containers:
- name: redis
  image: redis:6
  ports:
  - containerPort: 6379

9. Logging and Monitoring Issues¶

Missing Logs¶

Diagnosis:

# Check if pod is running
kubectl get pods

# Try different log commands
kubectl logs myapp-5d4b8c7f9-abc12
kubectl logs myapp-5d4b8c7f9-abc12 --previous
kubectl logs myapp-5d4b8c7f9-abc12 -c container-name

# Check logging infrastructure
kubectl get pods -n kube-system | grep -E 'fluentd|filebeat|logstash'

Solutions:

# Ensure application logs to stdout/stderr
# Check log rotation settings
# Verify logging agent is running
kubectl get daemonset -n kube-system

Metrics Not Available¶

Diagnosis:

# Check metrics server
kubectl get deployment metrics-server -n kube-system
kubectl top nodes  # Should work if metrics server is running

# Check metrics server logs
kubectl logs -n kube-system -l k8s-app=metrics-server

Solutions:

# Install metrics server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# For development clusters, may need to disable TLS
kubectl patch deployment metrics-server -n kube-system --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'

10. Common Troubleshooting Commands¶

Quick Diagnostics¶

# Get all resources
kubectl get all -A

# Check events
kubectl get events --sort-by=.metadata.creationTimestamp -A

# Check resource usage
kubectl top nodes
kubectl top pods -A

# Check cluster info
kubectl cluster-info
kubectl version

# Check component status
kubectl get componentstatuses

# Check API server
kubectl get --raw /healthz
kubectl get --raw /readyz

Deep Dive Commands¶

# Get pod YAML
kubectl get pod myapp-5d4b8c7f9-abc12 -o yaml

# Get pod JSON with jq
kubectl get pod myapp-5d4b8c7f9-abc12 -o json | jq '.status'

# Watch resources
kubectl get pods -w
kubectl get events -w

# Port forward for debugging
kubectl port-forward pod/myapp-5d4b8c7f9-abc12 8080:8080

# Execute commands in pod
kubectl exec -it myapp-5d4b8c7f9-abc12 -- /bin/bash
kubectl exec myapp-5d4b8c7f9-abc12 -- env
kubectl exec myapp-5d4b8c7f9-abc12 -- cat /etc/resolv.conf

# Copy files
kubectl cp myapp-5d4b8c7f9-abc12:/app/logs/app.log ./app.log

# Debug with ephemeral container (K8s 1.23+)
kubectl debug myapp-5d4b8c7f9-abc12 -it --image=busybox

11. Interview Preparation - Troubleshooting¶

STAR Method Examples¶

Situation: Production pods were crashing with OOMKilled errors.

Task: Identify root cause and implement solution without downtime.

Action: 1. Checked pod events and logs 2. Analyzed memory usage patterns 3. Identified memory leak in application 4. Increased memory limits temporarily 5. Worked with dev team to fix leak 6. Implemented proper resource limits and monitoring

Result: Zero downtime, 40% reduction in memory usage after fix.

Common Interview Questions¶

Q: Walk me through debugging a pod that won't start. A: 1. Check pod status: kubectl get pods 2. Describe pod: kubectl describe pod <name> 3. Check events for errors 4. Review logs: kubectl logs <name> 5. Check previous logs if restarting 6. Verify image exists and is pullable 7. Check resource availability 8. Verify configurations (ConfigMaps, Secrets) 9. Check RBAC/SCC permissions 10. Test with debug container if needed

Q: How do you troubleshoot network connectivity issues? A: 1. Verify service endpoints exist 2. Check pod labels match service selector 3. Test DNS resolution from pod 4. Check network policies 5. Verify ingress configuration 6. Test connectivity with debug pod 7. Check CNI plugin status 8. Review firewall rules 9. Verify load balancer configuration

Q: Describe a complex production issue you resolved. Be ready with a real example covering: - Initial symptoms - Investigation process - Tools used - Root cause analysis - Solution implemented - Prevention measures - Lessons learned

Troubleshooting Toolkit¶

Essential Tools:

# Debug containers
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash
kubectl run debug --rm -it --image=busybox -- sh

# Useful images:
# - nicolaka/netshoot (network debugging)
# - busybox (basic utilities)
# - curlimages/curl (HTTP testing)
# - alpine (lightweight with package manager)

Useful Commands:

# Inside debug pod
nslookup service-name
ping service-name
curl http://service-name
traceroute service-name
netstat -tulpn
ss -tulpn
tcpdump -i any port 80

Quick Reference Card¶

Pod Status Meanings¶

Pending: Waiting to be scheduled
ContainerCreating: Pulling image, creating container
Running: All containers running
Succeeded: All containers terminated successfully
Failed: At least one container failed
CrashLoopBackOff: Container keeps crashing
ImagePullBackOff: Cannot pull image
OOMKilled: Out of memory
Error: Generic error state
Unknown: Cannot determine state

Common Exit Codes¶

0: Success
1: General error
2: Misuse of shell command
126: Command cannot execute
127: Command not found
130: Terminated by Ctrl+C
137: Killed (SIGKILL) - often OOM
143: Terminated (SIGTERM)

Troubleshooting Flow¶

``` Problem Reported ↓ Check Pod Status ↓ Describe Pod (Events) ↓ Check Logs ↓ Verify Configuration ↓ Check Resources ↓ Test Connectivity ↓ Implement Fix ↓ Verify Solution ↓ Document & Monitor