Is Google Cloud Anthos Service Mesh a Mess?
Today we are going to learn how Google Cloud is using service mesh in the cloud, and it is not a simple way.
First, we are going to understand what service mesh is.
To do that we are going to imagine a microservice application. When we have a microservice architecture the complexity start to grow every time that a new component is added. Imagine a common application that is created by 20 microservices where most of the components talk to each other.
Every component needs to understand and discover all the components that are living in the same ecosystem or their wont be able to communicate with each other. Communication between components needs to be encrypted and in most cases only authorized when we have the correct source and destination. We need to be able to manage the traffic and even in some cases use blue-green or canary deployment, and we must understand our environment with an observability tool.
All these functionalities are not coming with our microservice environment and need to be added. We have many tools for whatever of those options but in most cases are going to add more complexity to our environment and in the end, we will end with one tool per option. To be able to have all these functionalities and much more in only one service we have service mesh.
Service mesh is based on 2 parts, a control plane, which can be the brain of the monster, and a data plane that in most cases is a sidecar container that will provide the proxy functionalities, routing load balancer, and observability. This sidecar in most of the versions is coming with common functionality called envoy that uses Iptables for the proxy but there is a mesh version called linkerd that is not using this envoy tool and because is not using iptables the speed is increased a lot.
This article is about GCP, so we need to talk about Anthos because this is the official way to deploy service mesh in a Kubernetes cluster GKE. Anthos is a managed infrastructure that can be deployed across different cloud providers like AWS, and on-premises. So for example if you want to deploy a GKE cluster in your data center, in AWS and GCP you can use Anthos.
So if we want to have service mesh in our distributed cluster, we are forced to use Anthos mesh and we are forced to use Istio as the software. Too many things that we can not change.
And in this environment, there is a point that is important and is going to bring some complexity mostly because there is no documentation about that. GKE as a Kubernetes cluster is coming with a Data plane (that is currently in version 2.0) that can be activated when we create a cluster. This Data plane is using eBPF and Cilium. And why this is important? because most of the functionalities that the Istio envoy is offering as a proxy and the observability (not in the other points) can be managed with eBPF and Cilium, much much much faster and without Iptables in the middle and without a sidecar because everything happens in the kernel.
There is currently a Cilium service mesh that is based on a sidecarless model, with only 1 envoy per node in a DaemonSet way. In this way, all performance problems that we have when we use a service mesh version with envoy disappeared.
But because there is no documentation about this topic we have no idea what happens inside our GCP distribution. And again, this is important, because if you activate in your cluster Dataplane v2 and at the same time you activate Anthos service mesh you are going to end in an Istio mesh with eBPF and cilium activated with no known configuration.
But in whatever case apart from this not documented point what we are going to get when we install Anthos service mesh in our clusters is a cool installation and simple way to use service mesh in our GKE clusters. For example, we have a Managed Anthos Service Mesh that is a Google-managed control plane and an optional data plane that you simply configure.
At the end of the day if you use microservices you are going to need a kind of mesh and Google Anthos mesh is not a bad option to start if you are in GCP.
Here's the same article in video form for your convenience: