OpenShift 4 Troubleshooting: Your Cluster is Broken, What's Next?

Monochrome illustration of a forlorn character sitting on the ground, holding a gun, with a scarf around the neck, overlooking a beach with footsteps leading away. Monochrome illustration of a forlorn character sitting on the ground, holding a gun, with a scarf around the neck, overlooking a beach with footsteps leading away.

OpenShift 4 brings a lot of great new features around cluster operations.

Things like container operating system, one click cluster updates and machine API make management of the Kubernetes cluster much less of a hassle.

But all of those new technologies, that lay at the core of the OpenShift 4, are quite new.

At some point, something will go wrong and you need to know how to troubleshoot your cluster issues.

In this article, I will give you few tips on how you can figure out what exactly happened with your OpenShift 4 cluster. We will start with examining the OpenShift Console.

First thing you need to know is that OpenShift comes with a complete monitoring stack out of the box.

Screenshot of an OpenShift cluster management dashboard displaying cluster details, status, utilization charts, and recent activity logs.

Every possible metric of every cluster component is collected by built in highly available Prometheus installation.

You can examine these metrics directly via the OpenShift Console. There are dozens of prebuilt dashboards. For example, here you can see many many detailed metrics about how etcd cluster is doing.

There are also metrics and dashboards for the complete cluster as well as for the invidibual nodes.

Screenshot of an OKD (OpenShift Kubernetes Distribution) metrics dashboard with various graphs such as RPC Rate, Active Streams, and other monitoring data.

The built in dashboards are great, but there is also a Grafana installation available for you. Inside this Grafana, you have exactly the same dashboards, but you get a bit more flexibility around visualising those metrics.

Screenshot of a monitoring dashboard showing CPU and memory utilization graphs with options for different time ranges and refresh intervals, and a sidebar navigation menu.

Every part of the cluster is managed by Cluster Operators. You can find the list of those operators in the Cluster Settings section of the console.

Screenshot of a software interface for OpenShift cluster settings showing current version, update status, and cluster ID information.

In this list you can see if any of the operators has any problems. You can also jump to the individual operator, like I am doing here with the network operator.

Screenshot of the OpenShift web console showing the details of a cluster operator named 'network', with version info and status indicating it is available.

Remember, that each operator is just another container in your cluster, meaning that you can see the logs and metrics of the operator just like any other container.

Talking about logs, you can always check them right here, in the console. The logs you see in the console are take directly from the cluster node where the container is running.

Screenshot of a Kubernetes/OpenShift web console displaying logs for a network operator pod, with messages indicating updates and reconciliations.

Additionally, you can configure the cluster logging. What it means is you will get the ElasticSearch cluster and Kibana to store the logs for extended periods of time and visualise and analyse them in a more convenient way. There is also an option to forward all the logs to the external log storage of your choice.

A screenshot of OpenShift Container Platform documentation about cluster logging detailing types of logs such as application, infrastructure, and audit logs with navigation items on the left.

If you have a problem with individual cluster nodes, you can always check them in the Compute section of the console - there are metrics and the current state of each node and machine.

Some of the Kubernetes resources are not visible in a nice way in this console. But you can always find them if you go to the Search section. In this search form, you get a list of all the resource types available in the cluster, and you can also see each resource in details.

Screenshot of an OpenShift Container Platform web console displaying a list of machines with their corresponding status, node information, provider state, region, and availability zone.

Also, you have a list of the Customer Resource Definitions at the bottom of the menu. Many cluster configurations are stored as the customer resource definitions. Once again, you can see each of those resources in details right here, in the console. Naturally, you can also do all of this with the kubectl or the open shift cli.

Screenshot of a computer interface showing the YAML configuration for cluster networking settings in a Kubernetes management platform.

There are many things you can see in the OpenShift console or with the CLI.

But sometimes your cluster is so broken, that none of those options are available.

In this case, it helps to have some way to login to the cluster nodes directly, with SSH or a tool like AWS Session Manager.

Once you are inside the node, you have most of the standard Linux utilities to investigate what's wrong with your cluster.

Screenshot of a terminal displaying the contents of the 'os-release' file from a Fedora CoreOS system, including version information and URLs for documentation and support.

You can check the network state with the ss utility, for example, or you can also verify Iptables. All the logs from each pod on this node are stored right there, so you can inspect them and search through them.

Screenshot of a terminal with the command 'ss -tuln' executed, displaying a list of network connections and associated processes in a table format with columns for Netid, State, Recv-Q, Send-Q, Local Address:Port, Peer Address:Port, and Process.

Screenshot of a computer terminal displaying the output of the 'iptables -L' command showing firewall rules and configurations.

One thing you need to remember is that OpenShift is using CoreOS, so you don't have a normal package management at your disposal. Instead, you can use rpm-ostree command to verify the latest deployment and check if matches your expectations.

A terminal window displaying the output of the 'rpm-ostree status' command with details of deployments for a system managed by 'machine-config-operator'.

Other that that, it's still your regular Linux file tree, so you can see all the configuration files in the /etc directory, if you need to check the container registry configuration that is used by the cri-o, for example.

Naturally, you can check different disk details and overhaul just navigate this node like a normal server. Just keep in mind, that the actual filesystem is a bit special, because of the way OSTree layers it.

Talking about cri-o, this will probably be your main candidate for troubleshooting. There is a crictl tool, that works similar to docker or pod man cli.

You can get the list of all containers running on this node and then, for example, find the etcd container. You can then exec into this container and run etcdctl commands to figure out if it works fine and if all the etcd nodes are okay.

Screenshot of a computer terminal with a command line interface showing the execution of commands related to etcd, a distributed key-value store for shared configuration and service discovery.

Just remember, that almost everything in OpenShift is running inside the container. On the node level, cri-o is the tool that manages those containers, so crictl should become a big friend of yours.

Finally, if you lost a normal way to login to the cluster and query the Kubernetes API, you can login to the control plane node and use the kubeconfig from there to query the API.

The kubeconfig is available at /var/lib/kubelet directory. Kubectl is already installed on each node, so you can just export the KUBECONFIG variable and start investigatign the broken cluster's API as usual.

A terminal screen with a user executing commands related to Kubernetes, including navigating to the kubelet directory and running 'kubectl get pods' and 'kubectl get nodes' commands.

I hope those tips will help you to fix your cluster if something really wrong will happen with it. Well, I also hope that nothing that bad will happen in the first place, but, as you know, hope is not a strategy, especially in Kubernetes cluster operations.

Cloud Native consulting: regardless if you are just getting started, or looking for a big shift from the old ways to the future, we are here to help. About consulting

Here' the same article in video form for your convenience: