Series "Service Mesh"
The Best Service Mesh: Linkerd vs Kuma vs Istio vs Consul Connect comparison + Cilium and OSM on top
This article is sponsored by Allianz Direct. Read about our partnership here.
In the previous article, we’ve defined what Service Mesh is and why you might need it. If you’ve missed it, go and watch it first - because in this article, we will compare some of the most popular and production-ready Service Mesh technologies.
We will also mention some of the emerging and promising Service Meshes, though, as they are not yet as mature and tested as the other ones, we won’t dare to recommend them just yet.
Keep in mind that this article is valid at the beginning of 2022. Technology evolves and changes really fast, so things might be very different at the time when you watch this article.
Let’s start by defining criteria that Service Mesh must fulfil. This will allow us to build a shortlist of the service meshes that we want to compare.
Criterion #1: Open Source
We are going to focus on the open source offerings, that are developed and supported in open, by multiple companies and many individual contributors. This also rules out any cloud-vendor specific services, like AWS AppMesh.
Criterion #2: Cover 3 Pillars of Service Mesh
There is no dictionary definition of what a service mesh is, but we will only look at the ones that provide at least some features around each of the Security, Observability and Traffic Management areas.
Criterion #3. Version 1.0 or above
If you need to choose a tool for your stack, it’s good idea to stick with something that exists since a couple of years and was validated and tried by other people. There are bigger chances that many critical bugs are already fixed, and many of the features you might need are already implemented either directly in the tool or by the community.
Criterion #4. Support multi-cluster
You rarely have just one Kubernetes cluster. In case of production clusters, you might have multi-region clusters or clusters for different purposes, and it helps to have a unified layer of observability, security and traffic management, for fail over, debugging and end to end encryption.
Bonus points: Follows or attempts to follow Service Mesh Interface
Let’s stop for a second to talk Service Mesh Interface, or SMI. SMI attempts to define a specification for common traffic management features of a service mesh. The idea is that instead of every Service Mesh providing it’s own custom resource for, let’s say, Traffic Splitting, they would instead follow the SMI specification.
The potential benefit is that you can swap Service Mesh implementations without necessarily changing your whole stack. Naturally, it will never be as easy, but the concept is nice. SMI is quite fresh, but goes in the same direction as other specifications in Kubernetes world - like Open Container Initiative, Container Storage Interface, Container Network Interface and so on.
We always support proper standards and specifications, and we give preference to the services meshes that follow the lead in embracing and supporting SMI.
With the criteria defined, let’s look at the shortlist of the Service Mesh technologies that we’ve prepared, in no particular order:
We’ve also tested OpenServiceMesh, but it quickly became obvious that it’s not quite ready for prime time - which is fair enough, given that it’s version 1.0 release didn't yet happen.
How we compared
This is how we tested each service mesh:
- We’ve deployed a new clean AWS EKS Cluster, with AWS ALB Controller taking care of the Ingress;
- We then deployed a robot-shop - an open source application, that consists of 11 microservices, including couple of databases. We intentionally avoided any of the service mesh-provided demo applications, as we wanted to see how the mesh will behave in a setup that was not intentionally built to showcase this mesh.
Afterwards, we’ve installed the service mesh and tried to connect aforementioned microservices to the data plane.
Once done, we tested various features of the mesh, including mTLS, traffic control, authorisation policies and so on
Out of curiosity, we tried using Service Mesh with AWS EKS Fargate - server less offering of EKS - but none of the meshes we tested was able to run proxies in Fargate Pods, due to security boundary enforced by Fargate.
We will look at each mesh from 5 different angles:
- Installation and configuration
- Data plane implementation
- mTLS and identity management
- Traffic control
We will also take a brief look at Cilium, as a high performant alternative to a fully featured service meshes.
In the end, we will draw some conclusions about which service mesh is the best and the worst one.
We've used robot-shop for comparison, as well as our own custom tool called Service Messer.
DevOps consulting: DevOps is a cultural and technological journey. We'll be thrilled to be your guides on any part of this journey. About consulting
We start by comparing the installation experience.
In most of the cases, mkdev recommendation is to use Helm for packaging and deploying Kubernetes applications, especially cluster-level components.
Almost every service mesh we tested, except Consul, provides a special CLI tool to install the mesh. In case of Linkerd, the CLI tool is the recommended installation way. We found this CLI to be easy to use, and it provides a set of useful features, like verifying that your cluster is compatible with Linkerd. Installer itself merely generates Kubernetes manifests, that can be then applied to the cluster - we would recommend to store those YAMLs in a source control. You can also use a Linkerd Helm Chart, even though in that case you need to generate identity certificates yourself.
If you choose to use the CLI, you won’t be able to easily switch to Helm later on.
In case of Istio, it is surprisingly CLI-focused. Helm chart is currently marked as alpha, while Operator mode is not recommended for new installations. The only choice you have is to use the istioctl - which, like Linkerd CLI, can generate manifests instead of instantly installing the mesh.
On a good side, Istio provides installation profiles, that allow you to select which exact features you expect from Istio - there is, for example, a minimal profile, that gives you just the basic components of Istio.
Among all the meshes, we found Kuma to provide the biggest number of installation options, starting from an excellent Helm chart, and going all the way to even support CloudFormation-based installation on top of AWS ECS, so that you can run Kuma control plane outside of the Kubernetes.
Finally, Consul installation is entirely focused on an excellent official Helm chart. The only problem is that in addition to the Consul Connect, you have to install Consul itself - unless, of course, you already have a Consul cluster running outside of Kubernetes. Consul is not a Kubernetes specific technology - it’s a separate almost decade old service discovery tool, that was built to work on any infrastructure. It has it’s own key value storage and it’s own way to configure things.
To conclude, each Service Mesh, except Istio, has a proper Helm support and can be installed and updated as any other Kubernetes application. The quality of the charts varies from mesh to mesh, with Linkerd and Consul leading the way in terms of amount of options you can configure, and the quality of the documentation around those charts.
Connecting the mesh and the proxy model
When we install the mesh, we install the control plane, that is in charge of everything that happens in the mesh. The second part of the mesh is the data plane, and there are some differences in how data plane is implemented in each mesh. Each mesh we looked at uses a sidecar proxy pattern - it means that all of your pods in the mesh will need to run an extra container, that takes over the network connections to your application. Every mesh we tested, except linkerd, uses envoy as a proxy.
Envoy itself is a separate open source project, that you can run without any service mesh. Because Envoy was built to be a cloud native proxy, most of the service meshes rely on it to build a data plane. It also means that most of the service meshes share the same set of features that envoy provides - and they also depend on the Envoy project development cycle.
The only exception from this is Linkerd, which has it’s own sidecar proxy, that, according to benchmarks, is many times faster that Envoy. Linkerd proxy is intentionally small and focused, and it has much less features than Envoy. Consul Connect also uses envoy as a sidecar proxy, but it also runs a DaemonSet with Consul Agent. What it means, is that in addition to an extra container next to each of your pods, Consul adds an extra pod on each of your cluster nodes. Once again, Consul is not a Kubernetes specific tool. It’s a general purpose agent-based service discovery and service mesh technology, that happens to support Kubernetes.
Each service mesh gives you 4 options to include a sidecar proxy and, thus, mesh your service into the mesh:
- Automatically inject the proxy for every pod in the cluster (We would not recommend this approach for existing clusters due to potential impact)
Automatically inject the proxy for every pod in a certain namespace or workload
Manually inject the proxy into the workloads you need
For each mesh, you have to re-create your pods for the proxy to be injected.
mTLS and service identity - and ingress
Each Service Mesh we tested supports mTLS - meaning, it gives your pods certificate-based identity and encrypts all the traffic between meshed services in the cluster. Each mesh allows you to use external PKI and provide your own certificates, and each mesh has multiple sources of identities for your pods.
The default one in most cases is to use Kubernetes Service Accounts, but if you mesh services outside of the Kubernetes, then, for example, Istio allows you to use another source of identity. Worth mentioning, that Linkerd is working only with Kubernetes, while Istio, Kuma and Consul Connect can be used on any infrastructure.
Each service mesh also allows you to configure Authorization Policies tied to the service identity - you can say that this service only accepts connections from that other service.
The challenge with mTLS is that, unless in a greenfield environment, you are rarely able to make every single pod part of the mesh at once. But once you have a pod with mTLS enabled, this pod would want to validate and encrypt all the incoming traffic. And while mTLS works transparently between already meshed ports, it gets a bit tricky with the external connections that are not part of the mesh - in simple words, you can not have mutual TLS, if one side of connection does not have any certificates of it’s own.
There are two problems with this:
- Your Kubernetes cluster performs health checks against your pods, and your Kubernetes node is not part of the Mesh. If you want to enforce mTLS, you need to make sure that the mesh can handle this case;
- Your Ingress Controller, to get external traffic to the cluster, might not run inside the cluster itself. For example, most of the public cloud ingress controllers simply provision a managed load balancer outside of the cluster - this load balancer can not be part of the mesh, thus it won’t be able to participle in mTLS.
Every mesh solves this problem differently.
Kuma does not have mTLS enabled by default at all, but once you enable it, none of the none-meshed components will be able to talk to the meshed ones. That, unless you enable PERMISSIVE mode, in which out-of-mesh services will talk to the meshed over a plaintext connection - Permissive mode can not be overridden per-service.
Linkerd takes the simplistic approach - it does not enforce mTLS at all, so any connection from the outside will not be encrypted with Linkerd - this is similar to the permissive mode of Kuma. You can enforce mTLS by using Linkerd AuthorisationPolicies, and in this case any none-meshed connection will break, including native Kubernetes probes - unless you whitelist based on the source IP address.
Istio allows you to mark certain connections to be in Permissive mode so that none-TLS connections are allowed. This is similar to Kuma’s permissive mode, but Istio goes further and lets you to set Permissive mode only for a particular service.
Still, Istio is supposed to be used with it’s own Ingress Gateway. This Ingress Gateway sits between your meshed services and your external load balancer, and you can benefit from Istio features on the edge of your cluster.
Consul Connect uses the same approach, with the difference that it does not have any way to allow none-meshed connections. Instead, you are forced to use Consul Ingress Gateway to get none-meshed traffic to your pods.
This where lays one of the biggest differences between meshes: Kuma and Linkerd let you use your existing method to get traffic to the cluster, while Istio and Consul take over this part and force you to re-think how you handle ingress connections. The trade off Kuma and Linkerd had to make is to avoid enforcing mTLS for complete traffic, and reserve it only for internal communication.
When comparing the traffic control capabilities of the mesh, you have to first decide which exact traffic control features you need. The only feature that is present in each of the meshes is traffic splitting and traffic splitting is arguably the one that will be used most by majority of the service mesh users.
Linkerd has the least amount of traffic control features - besides traffic splitting, it also allows you to configure retries and timeouts for your services. Linkerd traffic splitting is implemented via SMI TrafficSplit API, instead of providing their own custom resource types.
This is different from Consul, Kuma and Istio - each of those have their own custom resources for traffic control, and each of those provide the most flexibility around how to move traffic inside the mesh. They all, for example, allow to route traffic based on a value of an HTTP header - something that Linkerd is not capable of doing right now.
If you have a big requirement to configure advanced routing inside the mesh that goes beyond simple canary deployments, then Consul, Kuma and Istio are your best options. Keep in mind that each of those 3 can be quite complex to configure - you need to create multiple objects of different types, and despite generally great documentation of each of those 3 meshes, getting traffic control right is quite a complex task. We found that month those 3 meshes, Kuma customer resources are the easiest to comprehend and configure.
You should also remember, that the more complex your traffic control configuration is, the bigger impact it has on a latency. If latency is critical, then your best bet right now is Linkerd - even though it lacks many features, it’s much more performant and lightweight than standard envoy-based service meshes.
Another thing to consider is that over time, each service mesh will get more features around traffic control. So, for example, if you do not have a hard requirement for header-based routing or circuit breaking right now, you can safely pick Linkerd, knowing that in the future, most likely, this features will get there too.
A bit of a mind bending detail of traffic splitting in a service mesh is that quite it happens on client side, not on the server side. This means, that traffic splitting will work only if the client is also part of the mesh. Keep this in mind when integrating service mesh with existing applications, as the result might surprise you. To benefit from traffic splitting, for example, your load balancer has to be part of the mesh - that’s one of the reasons Consul and Istio have Ingress Gateways.
In simple terms, observability consists of two parts:
- Collection of as much data about your system as possible (which includes metrics, logs and tracing)
- Tooling to easily explore this data to be able to find what’s going on in your system
When it comes to data collection, all service meshes are more or less identical - each exposes a lot of metrics about everything that is happening between your services, each integrates with Prometheus and Grafana, and each works well with tracing systems like Zipkin and Jaeger.
Every mesh can be installed with built-in monitoring stack, and every mesh recommends in their documentation to roll out your own monitoring stack for production deployments. This means that even though you can use Prometheus or Grafana bundled with the mesh, you really should rather integrate your existing Prometheus and Grafana instead.
But providing the data is only one part of the observability. The second part is the tooling to explore this data. As already mentioned, you can use Grafana and tracing tools with each Service Mesh. What’s more interesting is the tools mesh provides on top of that.
Istio has Kiali, which is a fantastic web ui to visually explore almost everything that is happening in your mesh, all the connections between meshed and none-meshed components, with metrics and dashboards. The graph view of Kiali is especially great, as it has many configuration options and can be tweaked to show every possible detail of existing connections.
Linkerd web dashboard is also pretty nice, though it’s not as comprehensive as Kiali. It also has a graph overview, but it’s way simpler that the one in Kiali. On the other side, Linkerd provides a unique “tap” feature, that let’s you hook into the live traffic of every meshed service and explore every connection in details - this kind of debugging tool is essential for proper observability.
Both Consul and Kuma also come with their own dashboards, both very good looking, though not as useful in practice as Kiali and Linkerd.
Consul Dashboard can pull metrics directly from Prometheus, but does not have a graph overview of all connections. It’a also confusing that it shows you that every service is connected to every other service, instead of visualising only existing connections between them. On the positive side, there are direct links to documentation about particular feature of Consul, plus let’s not forget that this interface is not only for the service mesh - it serves as a general Consul dashboard.
Kuma’s dashboard surprised us by a nice data plane installation wizard and generally nice design, but in the end it appeared to be a simple read-only table overview of your mesh. But, on top of this dashboard, Kuma gives you excellent Grafana dashboards, with a custom data source and a graph overview of each service, right inside Grafana.
eBPF and Cilium Mesh
Before we wrap up, we should mention Cilium. Cilium is not a service mesh in a traditional sense of it. Instead, Cilium is a container network plugin, that uses eBPF instead of iptables. We won’t go into details about what eBPF is, but in short it allows to run software directly in the linux Kernel, which results in a greater flexibility and performance gains. Tell us if you want to see a separate article on eBPF and Cilium.
Installing Cilium requires you to get rid of existing CNI plugin. So for example, if you have an EKS cluster with AWS VPC CNI, you need to remove it first and then install Cilium, and restart all pods. As you can imagine, such an operation can be quite dangerous for a production cluster - so we would only recommend using Cilium in new clusters.
Once it’s installed, you can configure NetworkPolicies with Cilium. Those NetworkPolicies are quite simple when you compare them with the proper service mesh - you can only say that pods with this labels can or can not access pods with this other label, there is no identity-based authorization.
But, because of eBPF, Cilium is capable of doing something that none of the service meshes can: it is able to understand what kind of traffic is going through it, and enforce policy based on this traffic. So, for example, you can create a policy specific to Kafka or Cassandra traffic - and this policies will work in the kernel, via Cilium’s eBPF software, and it will be much faster than anything you might implement with Istio or Linkerd or other mesh.
eBPF also allows you to explore life traffic between every container, down to the TCP packet level. There is a separate web ui called Hubble, as well as the CLI tooling that can hook into the traffic and let you explore it in detail, a bit similar to Linkerd Tap feature, but on a much lower level.
In general, Cilium in it’s standard form is more of a very advanced and high performant networking tool, with an ability to configure traffic-specific firewall rules, and low-level network observability utilities. But, there is also now a Cilium Service Mesh, that combines Cilium’s network plugin with Envoy proxy per cluster node. As of now, this is still in beta, but in google cloud platform, for example, you can already use eBPF with Cilium and Istio altogether.
Cilium’s approach with eBPF and NetworkPolicies even without the service mesh, gives you many of the features you might need from the Mesh, at least around security via transparent ipsec encryption, observability, metrics and pod to pod firewall.
You can combine Cilium with existing Service Meshes, for example use it together with Linkerd and gain benefits from both tools - but, as already mentioned, you should be careful with installing Cilium on existing clusters, as it does basically require changing your container networking stack.
To sum it up
So which service mesh is the best? There is no clear winner in all of the categories, but there are some obvious winners in each of the categories we explored. And, keep in mind, that it’s impossible to compare every feature of every mesh, even if we look only at the 5 meshes that we took for comparison in this video.
Still, let us provide our conclusions from this comparison.
Consul Connect is probably the most mature simply because of Consul. Consul is a decade old , polished technology, battle tested in huge production environments. It’s a safe choice in terms of stability and features. There is only one problem with Consul Connect: it’s a not a purpose-built cloud native service mesh, but rather a service mesh built on top of Consul. You have to run Consul agent in addition to the envoy proxy, you have to maintain Consul cluster, and every Kubernetes resource specific to Consul Connect ends up being a Consul configuration object, stored in Consul key-value store - creating an addition level of indirection. Consul Connect also proved to be the hardest one to integrate with existing Ingress Controllers.
Istio, popularity wise, is the current leader in service mesh space. Features-wise, it's the most powerful and advanced mesh. You can do many things with it, even if those things are tricky to configure. There is a big community around Istio, and you have native integrations with Google Kubernetes Engine and OpenShift. Installing and using Istio is, complexity wise, like installing a Kubernetes cluster on top of a Kubernetes cluster - you get another few dozens of building blocks that you have to learn, but once you do it, you can achieve almost anything - except, of course, low latencies and resource consumption.
Linkerd is the most performant and focused service mesh out there. It’s intentionally small, with carefully selected feature set, convenient dashboard and a strong focus on doing just enough. Linkerd was the first service mesh, and engineering team behind it learned a thing or two about how service mesh should look and work like - it’s a pure joy to read their articles about how technical decisions were made for this mesh. Linkerd is also the easiest service mesh to get started with and integrate into existing clusters. But, Linkerd can be used only with Kubernetes, so if you want to mesh together traditional infrastructure and Kubernetes, you are out of luck.
And, finally, Kuma is one of the newest service meshes, aimed at solving many of the mistakes other meshes did. It’s very well thought, well documented and feature wise sits somewhere between Linkerd and Istio. Kuma claims to be a universal service mesh, meaning that it was built to accommodate both Kubernetes and traditional infrastructure, into a single mesh - not unlike Istio or Consul. It also has an enterprise version, with even more functionality on top. We were pleasantly surprised by both feature set of Kuma, and how easy it is to use.
Choosing a service mesh highly depends on your environment. Regardless of which mesh you choose, it’s a big, complex technology that becomes essential part of your stack. You should always start with the list of existing and potential future requirements, and select the mesh based on those.
Picking the tool only because it’s the most powerful or because it has the best dashboard would not be the right approach in the case of the service mesh. You might need to create a single mesh for multiple regions, data centres, clusters and thousands of virtual machines - and then the only good choice might be Consul or Kuma. Or you might just need a mesh for your existing Kubernetes cluster, to enhance, but not overcomplicate your environment - and in this case Linkerd is a tool.
If your company needs help selecting and integrating service mesh, please contact us via our website - we will schedule a call to discuss your needs and how we can help you with them.
There are many more meshes that we did not explore, like Traefik Mesh, and cloud specific meshes - each of those with their own correct uses cases. Please tell us in the comments which service mesh we should make more videos about, and which important points we might have missed in this comparison.
Here's the same article in video form for your convenience: