Kubernetes Troubleshooting: Essential kubectl Commands That Actually Work
Stop. There's a systematic way to debug Kubernetes that works every time. In this guide, I'll walk you through the exact kubectl commands and troubleshooting approach that experienced engineers use daily no guesswork, just results.
Why Kubectl Troubleshooting Matters
Kubernetes abstracts away infrastructure complexity, which is great until something breaks. When it does, you can't just ssh into a box and poke around. You need to understand how to query the cluster state, inspect logs, and trace problems from the cluster level down to individual containers.
The good news: kubectl gives you everything you need. You just need to know what to ask.
The Top Down Troubleshooting Approach
Always start at the cluster level and work your way down. Most issues reveal themselves in this progression:
- Cluster health Is the cluster itself working?
- Namespace health Are resources in the right namespace?
- Pod status Are pods running or stuck?
- Container logs What's the application actually doing?
This ordering saves hours. Don't skip to logs immediately you'll miss obvious issues.
Step 1: Check Cluster Health
Start here every single time:
kubectl get nodes
This shows all nodes in your cluster. Look for the STATUS column. You want Ready, not NotReady or SchedulingDisabled.
If a node is NotReady, dig deeper:
kubectl describe node <node-name>
This shows conditions, capacity, and recent events. Look for things like DiskPressure, MemoryPressure, or PIDPressure. These are often the culprit.
Check component health:
kubectl get cs
This gives you the status of core Kubernetes components: api-server, controller-manager, scheduler. If any show Unhealthy, your cluster has fundamental problems.
Step 2: Check Namespace and Pod Status
Once the cluster looks healthy, narrow down to your workload:
kubectl get pods -n <namespace>
Look at the STATUS column. Healthy pods show Running. Problem pods show things like:
PendingPod can't be scheduled (usually resource constraints)CrashLoopBackOffApplication is crashing repeatedlyImagePullBackOffCan't pull the container imageOOMKilledOut of memoryEvictedNode ran out of resources and killed the pod
For more detail on a specific pod:
kubectl describe pod <pod-name> -n <namespace>
Read the Events section at the bottom. This is your treasure map. It shows exactly what happened in chronological order.
Step 3: Get Inside the Logs
Now you can safely look at logs:
kubectl logs <pod-name> -n <namespace>
If the pod has multiple containers, specify which one:
kubectl logs <pod-name> -c <container-name> -n <namespace>
For a crashing pod, see the previous run's logs:
kubectl logs <pod-name> --previous -n <namespace>
Stream logs in real-time:
kubectl logs -f <pod-name> -n <namespace>
The -f flag works just like tail -f on Linux.
Step 4: Check Services and Networking
If pods are running but your app is unreachable:
kubectl get svc -n <namespace>
Look at the CLUSTER-IP and EXTERNAL-IP. Make sure they exist and look reasonable.
Check endpoints:
kubectl get endpoints -n <namespace>
A service with no endpoints means the selector isn't matching any pods. This is surprisingly common.
Test connectivity from inside the cluster:
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
Now try to reach the service from inside the pod. This tells you if the issue is external routing or internal networking.
Real-World Example: Debugging a CrashLoopBackOff
Let's say you deploy an app and immediately see CrashLoopBackOff. Here's exactly what you do:
1. Get the pod status:
kubectl get pods -n production
You see your app is CrashLoopBackOff.
2. Describe the pod:
kubectl describe pod myapp-abc123 -n production
The Events section shows "Back-off restarting failed container".
3. Check the logs from the previous run:
kubectl logs myapp-abc123 --previous -n production
You see: "Error: Database connection refused".
4. Check if the database service exists:
kubectl get svc -n production | grep database
Nothing. The database service isn't running.
5. Deploy the database:
kubectl apply -f database.yaml -n production
6. Verify the app recovers:
kubectl get pods -n production
Your app should move to Running now.
That's the entire workflow. Top-down, methodical, and it works.
Essential Commands Reference
Cluster overview:
kubectl cluster-infoGeneral cluster informationkubectl get nodesList all nodes and their statuskubectl describe node <name>Detailed node info
Pod debugging:
kubectl get pods -AAll pods in all namespaceskubectl describe pod <name>Pod events and statuskubectl logs <pod>Container logskubectl logs <pod> --previousPrevious run logskubectl logs -f <pod>Stream logskubectl exec -it <pod> -- /bin/shShell into a pod
Services and networking:
kubectl get svcList serviceskubectl get endpointsService endpointskubectl port-forward <pod> 8080:8080Forward local port to pod
Resource inspection:
kubectl get all -n <namespace>Everything in a namespacekubectl describe <resource-type> <name>Details on any resourcekubectl events -n <namespace>Recent cluster events
Common Mistakes to Avoid
Not specifying the namespace. kubectl defaults to the "default" namespace. Your pod might be in "production". Always use -n or set your default context.
Skipping the describe step. People jump straight to logs. The Events section in describe output often tells you the real problem before you even look at application logs.
Ignoring node-level issues. If multiple pods are failing in strange ways, check your nodes first. Disk full, out of memory, or kernel panics will affect everything.
Not checking the selector. Pods not being created? Service with no endpoints? Check if your labels match your selectors.
Assuming logs are the root cause. An error in logs is a symptom, not always the cause. A pod might crash because a dependency is missing, not because of code in the pod itself.
When to Escalate
If your cluster itself is unhealthy (nodes NotReady, api-server Unhealthy), you're looking at infrastructure or etcd problems. This needs a different level of troubleshooting.
If individual pods are hitting resource limits repeatedly (OOMKilled, CPU throttling), it's a capacity planning issue, not a bug.
If networking looks broken (endpoint discovery failing, services not accessible), check your CNI plugin and network policies.
The Bottom Line
Kubernetes troubleshooting isn't magic. It's a repeatable process: start at the cluster, work down to the namespace, then the pod, then the container. Use kubectl describe for context, logs for details, and always check the Events section first.
Next time something breaks at 3 AM, you'll know exactly where to look.
