Kubernetes Troubleshooting: Essential kubectl Commands That Actually Work

It's 3 AM. Your Kubernetes cluster is behaving strangely. Pods are crashing, services aren't responding, and you have no idea where to start looking. This is where most engineers panic and start wildly scrolling through pod logs.

Stop. There's a systematic way to debug Kubernetes that works every time. In this guide, I'll walk you through the exact kubectl commands and troubleshooting approach that experienced engineers use daily no guesswork, just results.

Why Kubectl Troubleshooting Matters

Kubernetes abstracts away infrastructure complexity, which is great until something breaks. When it does, you can't just ssh into a box and poke around. You need to understand how to query the cluster state, inspect logs, and trace problems from the cluster level down to individual containers.

The good news: kubectl gives you everything you need. You just need to know what to ask.

The Top Down Troubleshooting Approach

Always start at the cluster level and work your way down. Most issues reveal themselves in this progression:

Cluster health Is the cluster itself working?
Namespace health Are resources in the right namespace?
Pod status Are pods running or stuck?
Container logs What's the application actually doing?

This ordering saves hours. Don't skip to logs immediately you'll miss obvious issues.

Step 1: Check Cluster Health

Start here every single time:

kubectl get nodes

This shows all nodes in your cluster. Look for the STATUS column. You want Ready, not NotReady or SchedulingDisabled.

If a node is NotReady, dig deeper:

kubectl describe node <node-name>

This shows conditions, capacity, and recent events. Look for things like DiskPressure, MemoryPressure, or PIDPressure. These are often the culprit.

Check component health:

kubectl get cs

This gives you the status of core Kubernetes components: api-server, controller-manager, scheduler. If any show Unhealthy, your cluster has fundamental problems.

Step 2: Check Namespace and Pod Status

Once the cluster looks healthy, narrow down to your workload:

kubectl get pods -n <namespace>

Look at the STATUS column. Healthy pods show Running. Problem pods show things like:

Pending Pod can't be scheduled (usually resource constraints)
CrashLoopBackOff Application is crashing repeatedly
ImagePullBackOff Can't pull the container image
OOMKilled Out of memory
Evicted Node ran out of resources and killed the pod

For more detail on a specific pod:

kubectl describe pod <pod-name> -n <namespace>

Read the Events section at the bottom. This is your treasure map. It shows exactly what happened in chronological order.

Step 3: Get Inside the Logs

Now you can safely look at logs:

kubectl logs <pod-name> -n <namespace>

If the pod has multiple containers, specify which one:

kubectl logs <pod-name> -c <container-name> -n <namespace>

For a crashing pod, see the previous run's logs:

kubectl logs <pod-name> --previous -n <namespace>

Stream logs in real-time:

kubectl logs -f <pod-name> -n <namespace>

The -f flag works just like tail -f on Linux.

Step 4: Check Services and Networking

If pods are running but your app is unreachable:

kubectl get svc -n <namespace>

Look at the CLUSTER-IP and EXTERNAL-IP. Make sure they exist and look reasonable.

Check endpoints:

kubectl get endpoints -n <namespace>

A service with no endpoints means the selector isn't matching any pods. This is surprisingly common.

Test connectivity from inside the cluster:

kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Now try to reach the service from inside the pod. This tells you if the issue is external routing or internal networking.

Real-World Example: Debugging a CrashLoopBackOff

Let's say you deploy an app and immediately see CrashLoopBackOff. Here's exactly what you do:

1. Get the pod status:

kubectl get pods -n production

You see your app is CrashLoopBackOff.

2. Describe the pod:

kubectl describe pod myapp-abc123 -n production

The Events section shows "Back-off restarting failed container".

3. Check the logs from the previous run:

kubectl logs myapp-abc123 --previous -n production

You see: "Error: Database connection refused".

4. Check if the database service exists:

kubectl get svc -n production | grep database

Nothing. The database service isn't running.

5. Deploy the database:

kubectl apply -f database.yaml -n production

6. Verify the app recovers:

kubectl get pods -n production

Your app should move to Running now.

That's the entire workflow. Top-down, methodical, and it works.

Essential Commands Reference

Cluster overview:

kubectl cluster-info General cluster information
kubectl get nodes List all nodes and their status
kubectl describe node <name> Detailed node info

Pod debugging:

kubectl get pods -A All pods in all namespaces
kubectl describe pod <name> Pod events and status
kubectl logs <pod> Container logs
kubectl logs <pod> --previous Previous run logs
kubectl logs -f <pod> Stream logs
kubectl exec -it <pod> -- /bin/sh Shell into a pod

Services and networking:

kubectl get svc List services
kubectl get endpoints Service endpoints
kubectl port-forward <pod> 8080:8080 Forward local port to pod

Resource inspection:

kubectl get all -n <namespace> Everything in a namespace
kubectl describe <resource-type> <name> Details on any resource
kubectl events -n <namespace> Recent cluster events

Common Mistakes to Avoid

Not specifying the namespace. kubectl defaults to the "default" namespace. Your pod might be in "production". Always use -n or set your default context.

Skipping the describe step. People jump straight to logs. The Events section in describe output often tells you the real problem before you even look at application logs.

Ignoring node-level issues. If multiple pods are failing in strange ways, check your nodes first. Disk full, out of memory, or kernel panics will affect everything.

Not checking the selector. Pods not being created? Service with no endpoints? Check if your labels match your selectors.

Assuming logs are the root cause. An error in logs is a symptom, not always the cause. A pod might crash because a dependency is missing, not because of code in the pod itself.

When to Escalate

If your cluster itself is unhealthy (nodes NotReady, api-server Unhealthy), you're looking at infrastructure or etcd problems. This needs a different level of troubleshooting.

If individual pods are hitting resource limits repeatedly (OOMKilled, CPU throttling), it's a capacity planning issue, not a bug.

If networking looks broken (endpoint discovery failing, services not accessible), check your CNI plugin and network policies.

The Bottom Line

Kubernetes troubleshooting isn't magic. It's a repeatable process: start at the cluster, work down to the namespace, then the pod, then the container. Use kubectl describe for context, logs for details, and always check the Events section first.

Next time something breaks at 3 AM, you'll know exactly where to look.

ads

Latest Update

Latest Update

Kubernetes Troubleshooting: Essential kubectl Commands That Actually Work

Why Kubectl Troubleshooting Matters

The Top Down Troubleshooting Approach

Step 1: Check Cluster Health

Step 2: Check Namespace and Pod Status

Step 3: Get Inside the Logs

Step 4: Check Services and Networking

Real-World Example: Debugging a CrashLoopBackOff

Essential Commands Reference

Common Mistakes to Avoid

When to Escalate

The Bottom Line

About Bikram Bhujel

Exchange Rates

Join Bikram’s Newsletter

Featured Post

Troubleshooting SSO Authentication Failures: HTTP 404 on STS Endpoint

Labels

Contact Form

ads

Latest Update

Latest Update

Kubernetes Troubleshooting: Essential kubectl Commands That Actually Work

Why Kubectl Troubleshooting Matters

The Top Down Troubleshooting Approach

Step 1: Check Cluster Health

Step 2: Check Namespace and Pod Status

Step 3: Get Inside the Logs

Step 4: Check Services and Networking

Real-World Example: Debugging a CrashLoopBackOff

Essential Commands Reference

Common Mistakes to Avoid

When to Escalate

The Bottom Line

About Bikram Bhujel

Social Counter

Exchange Rates

Join Bikram’s Newsletter

Featured Post

Troubleshooting SSO Authentication Failures: HTTP 404 on STS Endpoint

Subscribe

Labels

Contact Form