DEV Community: t.v.vignesh

Infrastructure Engineering - Deployment Strategies

t.v.vignesh — Sat, 13 Feb 2021 09:38:52 +0000

This blog is a part of a series on Kubernetes and its ecosystem where we will dive deep into the infrastructure one piece at a time

Now that we have answered all the basic questions in relation to Kubernetes and have also had a look at the various architectural patterns you can opt for, the next step to this puzzle is to figure out the deployment strategy which is of the right fit for you. And that is what we are going to discuss in this blog post.

You might have seen people talk about Private Cloud, Public Cloud, On-Premise and Hybrid Cloud deployments. But with Kubernetes, a lot of these differences fade away since most of the differences between all these environments are typically not related to Kubernetes at all but to the infrastructure supporting Kubernetes deployments.

Public Cloud

In a public cloud deployment, the cloud provider takes care of almost everything around your Kubernetes cluster giving you a near-unlimited scalability, lowest maintenance and lower costs since you are going to share all the resources with other tenants as well making it a great option for all businesses which does not host any highly confidential data and can manage on a shared infrastructure.

Ultimately, the public cloud is all about maintaining a shared security model with both the cloud provider and the users playing significant roles. You can read about GKE Shared Security Model or GCP Shared Responsibility Model, AWS Shared Responsibility Model and Azure Shared Responsibility Model to learn more about what different cloud providers say about the responsibilities they take and those that are offloaded to you.

Private Cloud

While there are soft-partitions typically in place in a public cloud, sharing of resources between multiple tenants have often been viewed as a security concern by some organizations along with some sectors like banking/finance, health and military having strict regulations in place on where and how you host data along with the various data localization laws that govern a respective region.

In such cases, a private cloud can actually help you have more isolation and control of all resources (hardware and software) while allowing the cloud provider manage the resources for your tenant. It is like your own private data center in the cloud. While this adds a significant overhead to the pricing and also adds extra pieces to be managed by dedicated DevOps team, it can sometimes be worth it rather than having to manage everything on premise while also catering to workload elasticity.

Not all cloud providers do support private cloud (remember that Virtual Private Cloud and Private Cloud are quite different) with not many providers supporting private cloud without a VPC.

Virtual Private Cloud

This is the typical go to option when you want to opt in for private cloud. A Virtual Private Cloud (VPC) is nothing but running a private cloud in the public cloud infrastructure with multiple tenants separated by different subnets, private IPs and peering essentially simulating a dedicated environment (while the underlying infrastructure is still shared).

This would fit most of the use cases requiring compliance with all the regulations in place isolating the data transmission, processing and storage all in a private environment while being able to maintain the same cost as using a public cloud.

On-Premise

While there has been a huge cloud adoption over the years, on-premise systems still have its own place. They are being used to maintain the highest level of isolation, bolstering its own infrastructure, network, security, DNS and so on allowing the business to have complete control on all the infrastructure being used, reduce recurring costs, establish use-case specific network optimizations and even function in case of global failures from the other cloud providers making it a good bet if you are working with huge amounts of compute and data over long periods of time and have the resources to have on-premise datacenters in place. But do note that on-premise is not without challenges and its always better to have a cloud strategy to fall back to while trying to do it all on-premise.

Hybrid Deployments

While there are a lot of deployment strategies in place, hybrid cloud stands out amongst others since it allows you to use multiple deployment models or cloud providers depending on your needs and make the complete system work together as one (for eg. you can use an on-premise deployment for regulated industries and a public cloud deployment for others, or you can use GCP in US and INDIA while you opt for AWS, Azure or Alibaba in CHINA).

This is made possible by the very nature of Kubernetes being a standard portable platform across cloud providers, ability to manage infrastructure as code, ability to setup networking between them whenever needed with the help of multi-cluster service meshes and also due to the ability to orchestrate the deployments using Kubefed and Crossplane.

There has recently been new proprietary options in the market to enable such hybrid cloud deployments with services like Google Anthos, Azure Stack and AWS Outposts in place for enterprises who is looking to start with this journey with most of the heavy lifting done by the cloud providers. But do watch out for the pricing since it can end up to be costly over long periods.

Hybrid Deployments has to be done with great care since it adds a lot of complexity to the infrastructure you may have to manage also keeping in mind the pricing (eg. cross-region network calls may end up costing quite a lot)

Workload Portability

Thinking of hybrid deployments brings us to workload portability because unless you have a portable workload, it may not be feasible to have hybrid deployment strategies. This also means that you have to reduce the dependence on proprietary services from your cloud providers as much as possible since you might have to end up doing cross-cloud or cross-region API calls otherwise if your other cloud provider or on-premise systems don't support it. Or sometimes you might even have to build out abstractions within your applications since not all the same service across multiple cloud providers does not often have the same APIs adding more complexity especially with hybrid architectures or you might have to use something like Crossplane to enable this for you to some extent.

But if all of these are not an issue, then Containers and an orchestration system like Kubernetes can always take care of workload portability especially with OCI now in place for containers and CSI, CNI, CRI, SMI for storage, networking, runtime and service mesh respectively creating a healthy standards based ecosystem for all thereby enabling workload portability without lock-in since for a workload to be truly portable, all the underlying resources should be portable without any/very limited changes.

While Kubernetes constructs like Pod and Deployment don't lock you in to a respective provider, you have to take into account the underlying infrastructure from the cloud provider (storage, compute and networking) which can sometime impact the way in which K8 runs its workloads across providers.

The Best Practices

If you are in your way to opt for a cloud provider, do make sure that you check out their best practices documentation which can really help you in understanding the ways in which you should organize and manage your resources for reaping the maximum benefit with future in mind. For instance, use this for Google Cloud, this for AWS, this for Azure, this for VMWare and there are lot of case studies which can help as well as you start your journey.

Bare Metal, Virtual Machines, Containers and Serverless

The next question which you might have is, what is the best unit of deployment for me? Should I go bare metal or virtual machines or containers or serverless? This depends completely on your use case and the degree of abstraction and control you want over your infrastructure.

Bare Metal: Bare metal servers are servers which does not have any hypervisor on top thereby making it a single tenant and has the complete control of storage, networking and compute is available to you. You get the benefit of more storage, faster deploy times, faster speeds, efficient container deployments since you don't have to deal with VMs in this case which can be a significant overhead over your host operating systems. But this also means that you have to end up opting for dedicated instances from your cloud provider which can cost you more since you have to account for the elasticity you may need in advance and also the fact that you have to manage everything by yourself (your cloud provider completely gets out of the picture once you have the bare metal server in place) which can add operational overhead if you don't have the right team and tools in place.
Virtual Machines: Virtual machines have made multitenancy possible thereby reducing the costs for the users since hardware gets shared depending on the needs. The other thing about virtual machines is the ability to scale up/down whenever needed by adding more VMs and load balancing between them but this is not as elegant as containers or serverless since you also end up having an operating system to take care of in every VM that you add which can lead to operational nightmare during patches/upgrades, add to huge licensing costs in case you are going for a proprietary OS and at the same time is not efficient as containers since containers share the underlying compute, storage and networking better giving you the ability to spin up more containers for the same cost you are spending on your VM.
Containers: Containers have literally brought about a revolution in DevSecOps and infrastructure and people started realizing the huge benefits with Docker pioneering the movement and making it accessible for all (even though it was in use before). Containers make it possible to isolate your workloads without having to worry about managing new virtual machines, have a consistent/reproducible deployment across multiple environments, allow for efficient scalability, drastically reduce on the licensing costs and adding an orchestration system like Kubernetes or Swarm makes it even more powerful giving us the ability to treat containers like cattle and cater to all kinds of failures you can have in a typical distributed system. And this has truly changed the way we operate today but does require a significant tooling to be in place for it to work properly.
Serverless: Serverless has long been seen as the final step to elastic computing. And while it seems ambitious, it cannot completely replace containers, virtual machines or bare metal deployments. It is to be seen as a great complement to them all considering the significant limitations they have. When you want to go for serverless you have to take into account a few things like Cold/Warm/Hot start of serverless functions since that decides the latency of the response you are going to get, keep in mind that every cloud provider has a timeout for execution like 15 minutes in case of AWS, 9 minutes in case of Google Cloud functions, 10 minutes for Azure Functions and so on making it unsuitable for long running jobs. In addition to this, there are also restrictions to the programming languages you can use in your serverless function (unless you are opting in for a container based deployment which essentially makes it a container based deployment 🤔) . If you still want to use serverless for long running jobs, you might have to reach out to them for dedicated/premium plans or maintain your own serverless infrastructure within your Kubernetes cluster using something like KNative, OpenFaas, Kubeless or similar and setting your own limits.

As you might already realize from what we have discussed, the best way forward is to choose the right strategy for your use case rather and embracing hybrid strategies so that you can keep all the factors including performance, scalability, security, usability and costs in check.

Looking for help or engineering consultancy? Feel free to reach out to me @techahoy or via LinkedIn.

And if this helped, do share this across with your friends, do hang around and follow us for more like this every week. See you all soon.

Infrastructure Engineering - Architecting your Cloud Native Infrastructure

t.v.vignesh — Sat, 13 Feb 2021 07:55:04 +0000

This blog is a part of a series on Kubernetes and its ecosystem where we will dive deep into the infrastructure one piece at a time

In the last blog, we explored the various questions one might have when starting off with Kubernetes and its ecosystem and did our best to answer them. Now that justice has been done to clear the clouded thoughts you may have, let us now dive into the next important step in our journey with Kubernetes and the infrastructure as a whole.

In this blog, we will look at the best possible way to architect your infrastructure for your use case and the various decisions you may want to take depending on your constraints.

The Architecture

Quoting from one of our previous blog posts,

"Your architecture hugely revolves around your use case and you have to be very careful in getting it right and take proper consultation if needed from experts. While it is very important to get it right before you start, mistakes can happen, and with a lot of research happening these days, you can often find any revolution happen any day which can make your old way of thinking obsolete.

Let's see how we would realize our goal of architecting our system considering a client-server model in mind.

The Entrypoint - DNS

In any typical infrastructure (cloud native or not), the request has to be first resolved by the DNS server to return the IP address of the server as appropriate. This is where organizations like IANA and ICANN play a major role resolving the TLDs as needed with the various RIRs (Regional Internet Registries) it has which then gets routed to the appropriate registrar (say Godaddy, Bigrock, Google Domains, Namecheap, etc.) along with organizations like IETF defining the protocols and standards for the internet to work.

Setting up your DNS should be based on the availability you would require. If you require higher availability, you may want to distribute your servers across multiple regions or cloud providers depending on the level of availability you would like to achieve and you have to configure the DNS records accordingly in order to support that.

If you would like to know more about IANA, I would recommend you to watch this video:

Or this video from Eli

Content Delivery Network (CDN)

In some cases, you might look forward to serve the users with minimum latency as possible and also reduce the load on your servers while doing the same by distributing a major portion of the traffic to the edge. This is where CDN plays a major role.

Does the client frequently request a set of static assets from the server? Are you aiming to improve the speed of delivery of content to your users while also reducing the load on your servers?

In such cases, a CDN at edge with a TTL serving a set of static assets based on constraints might actually help to both reduce the latency for users and load on your servers.

Is all your content dynamic? Are you fine with serving content to users with some level of latency in favor of reduced complexity? Or is your app receiving low traffic?

In such cases, a CDN might not make much sense to use and you can send all the traffic directly to the Global Load Balancer. But do note that having a CDN also does have the advantage of distributing the traffic which can be helpful in the event of DDOS attacks on your server.

A lot of third party providers provide CDN services which includes Cloudfare CDN, Fastly, Akamai CDN, Stackpath but there is a high chance that your cloud provider might also offer a CDN service like Cloud CDN from GCP, CloudFront from Amazon, Azure CDN from Azure and the list goes on.

Load Balancers

If there is a request which cannot be served by your CDN, the request enters your Load Balancer. And these can be either regional with Regional IPs or global with Anycast IPs and in some cases, you can also use load balancers to manage internal traffic.

Apart from routing and proxying the traffic to the appropriate backend service, the Load balancer can also take care of responsibilities like SSL Termination, integrating with CDN and more making it an essential part to managing network traffic.

While Hardware Load Balancers do exist, Software Load Balancers have been taking the lead thus far providing greater flexibility, cost reduction and scalability.

Similar to CDNs, your cloud providers should be able to provide a load balancer as well for you (such as GLB for GCP, ELB for AWS, ALB for Azure, etc.) but what is more interesting is that, you can provision these load balancers directly from Kubernetes constructs. For instance, creating an ingress in GKE (aka GKE ingress) also creates a GLB for you behind the scenes to receive the traffic and other features like CDN, SSL Redirects, etc. can also be setup just by configuring your ingress as seen here

While you should always start small, load balancers would allow you to scale incrementally having architectures like this:

Networking & Security Architecture

The next important thing to take care of in your architecture is the networking itself. You may want to go for a private cluster if you want to increase security by moderating the inbound and outbound traffic, mask IP addresses behind NATs, isolate networks with multiple subnets across multiple VPCs and so on leading to a controlled environment which can possibly prevent security concerns in the future.

How you setup your network would typically depend on the degree of flexibility you are looking for and how you are going to achieve it. Setting up the right networking is all about reducing the attack surface as much as possible while still allowing for regular operations.

Protecting your infrastructure by setting up the right network also involves setting up firewalls with the right rules and restrictions so that you allow only the traffic as allowed to/from the respective backend services both inbound and outbound.

In many cases, these private clusters can be protected by setting up Bastion Hosts and tunneling through them for doing all the operations in the cluster since all you have to expose to the public network is the Bastion (aka Jump host) which is typically setup in the same network as the cluster.

Some cloud providers also provide custom solutions in their approach towards Zero Trust Security. For instance, GCP providers its users with Identity Aware Proxy (IAP) which can be used instead of typical VPN implementations.

Now, while all these may not be required when you start off your journey architecting with Kubernetes, it is good to be aware of all these so that you can incrementally adopt these as and when needed.

Once all of these are taken care of, the next step to networking would be setting up the networking within the cluster itself depending on your use case.

It can involve things like:

Setting up the service discovery within the cluster (which is handled by default by CoreDNS)
Setting up a service mesh if needed (eg. LinkerD, Istio, Consul, etc.)
Setting up Ingress controllers and API Gateways (eg. Nginx, Ambassador, Kong, Gloo, etc.)
Setting up network plugins using CNI facilitating networking within the cluster.
Setting up Network Policies moderating the inter-service communication and exposing the services as needed using the various service types
Setting up inter-service communication between various services using protocols and tools like GRPC, Thrift or HTTP
Setting up A/B testing which can be easier if you use a service mesh like Istio or Linkerd

If you would like to look at some sample implementations, I would recommend looking at this repository which helps users setup all these different networking models in GCP including hub and spoke via peering, hub and spoke via VPN, DNS and Google Private Access for on-premises, Shared VPC with GKE support, ILB as next hop and so on using Terraform

And the interesting thing about networking in cloud is that it need not be just be limited to the cloud provider within your region but can span across multiple providers across multiple regions as needed and this is where projects like Kubefed, Crossplane definitely does help.

If you would like to explore more on some of the best practices when setting up VPCs, subnets and the networking as a whole, I would recommend going through this, and the same concepts are applicable for any cloud provider you are onboard with.

Kubernetes Masters

The masters are automatically managed by the cloud provider without any access to the users if you are using managed clusters like GKE, EKS, AKS and so on thereby lifting a lot of complexity away from the users. And if that is the case, you typically would not have to worry much about managing the masters. But you should be the one specifying the kind of master you would need (regional or zonal) depending on the fault tolerance and availability you require.

If you are managing the masters yourselves rather than the cloud provider doing it for you, that is when you need to take care of many things like maintaining multiple masters as needed, backing up and encrypting the etcd store, setting up networking between the master and the various nodes in the clusters, patching your nodes periodically with the latest versions of OS, managing cluster upgrades to align with the upstream Kubernetes releases and so on and this is only recommended if you can afford to have a dedicated team which does just this.

Site Reliability Engineering (SRE)

When you maintain a complex infrastructure, it is very important to have the right observability stack in place so that you can find out errors even before they are noticed by your users, predict possible changes, identify anomalies and have the ability to drill down deep into where the issue exactly is.

Now, this would require you to have agents which expose metrics as specific to the tool or application to be collected for analysis (which can either follow the push or pull mechanism). And if you are using service mesh with sidecars, they often do come with metrics without doing any custom instrumentation by yourself.

In any such scenarios, a tool like Prometheus can act as the time series database to collect all the metrics for you along with something like OpenTelemetry to expose metrics from the application and the various tools using inbuilt exporters, Alertmanager to send notifications and alerts to multiple channels, Grafana as the dashboard to visualize everything at one place and so on giving users a complete visibility on the infrastructure as a whole.

In summary, this is what the observability stack involving prometheus would look like (Source: https://prometheus.io/docs/introduction/overview/)

Having complex systems like these also require the use of log aggregation systems so that all the logs can be streamed into a single place for easier debugging. This is where people tend to use the ELK or EFK stack with Logstash or FluentD doing the log aggregation and filtering for you based on your constraints. But there are new players in this space, like Loki and Promtail which does the same thing but in a different way.

This is how log aggregation systems like FluentD simplify our architecture (Source: https://www.fluentd.org/architecture)

But what about tracing your request spanning across multiple microservices and tools? This is where distributed tracing also becomes very important especially considering the complexity that microservices comes with. And this is an area where tools like Zipkin and Jaeger have been pioneers with the recent entrant to this space being Tempo

While log aggregation would give information from various sources, it does not necessarily give the context of the request and this is where doing tracing really helps. But do remember, adding tracing to your stack adds a significant overhead to your requests since the contexts have to be propagated between services along with the requests.

This is how a typical distributed tracing architecture looks like (Source: https://www.jaegertracing.io/docs/1.21/architecture/)

But site reliability does not end with just monitoring, visualization and alerting. You have to be ready to handle any failures in any part of the system with regular backups and failovers in place so that either there is no data loss or the extent of data loss is minimized. This is where tools like Velero play a major role.

Tools like Velero helps you to maintain periodic backups of various components in your cluster including your workloads, storage and more by leveraging the same Kubernetes constructs you use. This is how Velero's architecture looks like (Source: https://velero.io/docs/v1.5/how-velero-works/)

As you notice, there is a backup controller which periodically does backup of the objects pushing the backups to a specific destination with the frequency based on the schedule you have set. This can be used for failovers and migrations since almost all objects are backed up (while you do have control to backup just what you need)

Now, all these form some of the most important parts of the SRE stack and while there is more, this should be a good start.

Storage

While it is easy to start off with storage in Kubernetes with Persistent Volumes, Persistent Volume Claims and Storage classes, it becomes difficult as we scale since storage may need to be clustered with increased in load, backed up, synced when splitting workloads across multiple clusters or regions making it a difficult problem to solve.

Also there are a lot of different storage provisioners and filesystems available which can vary a lot between cloud providers and this calls for a standard like CSI which helps push most of the volume plugins out of tree thereby making it easy to maintain and evolve without the core being the bottleneck.

This is what the CSI architecture typically looks like supporting various volume plugins and provisioners (Source: https://kubernetes.io/blog/2018/08/02/dynamically-expand-volume-with-csi-and-kubernetes/)

But how do we handle the part where we talked about Clustering, scale and various other problems which comes with storage?

This is where file systems like Ceph has already proved itself being used for a long time in production by a lot of companies. But considering that it was not built with Kubernetes in mind and is very hard to deploy and manage, this is where a project like Rook can really help.

While Rook is not coupled to Ceph, and supports other filesystems like EdgeFS, NFS, etc. as well, Rook with Ceph CSI is like a match made in heaven. This is how the architecture of Rook with Ceph looks like (Source: https://rook.io/docs/rook/v1.5/ceph-storage.html)

As you can see, Rook takes up the responsibility of installing, configuring and managing Ceph in the Kubernetes cluster and this becomes interesting since the storage is distributed underneath automatically as per the user preferences. But all this happens without the app being exposed to any complexity which lies underneath.

You still request for a claim as you would typically do but it is just that your request is served by Rook and Ceph rather than the cloud provider itself.

Image Registry

If you are onboard a cloud provider, there is a high chance that they already provide image registry as a service already (eg. GCR, ECR, ACR, etc.) which takes away all the complexity from you and you should be good to go and if your cloud provider does not provide one, you can also go for third party registries like Docker Hub, Quay, etc.

But what if you want to host your own registry?

This may be needed if you either want to deploy your registry on premise, want to have more control over the registry itself or want to reduce costs associated with operations like vulnerability scanning.

If this is the case, then going for a private image registry like Harbor might actually help. This is what the architecture of Harbor looks like (Source: https://goharbor.io/docs/1.10/install-config/harbor-ha-helm/)

Harbor is an OCI compliant registry which is made of various components which includes Docker registry V2, Harbor UI, Clair, Notary, backed by a cache like Redis and a database like Postgres.

Ultimately, it provides you a user interface where you can manage various user accounts having access to the registry, push/pull images as you normally would, manage quotas, getting notified on events with webhooks, do vulnerability scanning with the help of Clair, sign the pushed images with the help of Notary and also handle operations like mirroring or replication of images across multiple image registries behind the scenes. All of this makes harbor a great fit for a private registry on Kubernetes and its great especially it being a graduated project from CNCF.

CI/CD Architecture

Kubernetes acts as a great platform for hosting all your workloads at any scale but this also calls for a standard way of deploying them all with a streamlined CI/CD workflow. This is where setting up a pipeline like this can really help (Source: https://thenewstack.io/ci-cd-with-kubernetes-tools-and-practices/)

If you are using a third party service like Travis CI, Circle CI, Gitlab CI or Github Actions, they do come with their own CI runners which run on their own infrastructure requiring you to just define the steps in the pipeline you are looking to build. This would typically involve, building the image, scanning the image for possible vulnerabilities, running the tests and pushing it to the registry and in some cases provisioning a preview environment for approvals.

Now, while the steps would typically remain the same if you are managing your own CI runners, you would need to configure them to be setup either within or outside your clusters with appropriate permissions to push the assets to the registry. For an example on using Skaffold and Gitlab with Kubernetes executor with GCR as the registry and GKE as the deployment target, you can have a look at this tutorial.

With this, we have gone over the architecture of various important parts of the infrastructure taking different examples from Kubernetes and its ecosystem. As we have seen above, there are various tools as part of the CNCF stack addressing different problems with infrastructure and they are to be viewed like Lego blocks each focusing on a specific problem at hand and is to be viewed like a black box abstracting a lot of complexity underneath for you.

This allows users to leverage Kubernetes in an incremental fashion rather than getting onboard all at once using just the tools you need from the entire stack depending on your use case. The ecosystem is still evolving especially with a lot of standards now in place allowing users to adopt it all without any kind of vendor lock-in at any point.

If you have any questions or are looking for help or consultancy, feel free to reach out to me @techahoy or via LinkedIn.

Launching my Patreon Page

t.v.vignesh — Sat, 16 Jan 2021 06:26:05 +0000

Hi friends. As the title suggests, this is a blog post announcing the launch of my patreon page: https://patreon.com/vignesh

Here is the introductory video from me regarding the same:

*Excuse me, since I repeatedly used the word Patreon many times instead of Patron in the video 😂

While I truly believe that knowledge should be free for all, I also do believe that it should be sustainable over a long period of time for the creators to keep making amazing content for the community.

This is where a platform like Patreon or Open Collective can really help as creators like me spend the valuable time creating quality content for all of the community.

You might have seen some posts from me in Dev.to or Medium and some live streams from my Youtube channel as well.

While I would love to create content for all, I would also like to have more motivation and incentive to do so over a long term and as you might already agree, money can be a pretty good motivator 😉..

Just kidding, but its true.

For all those who cannot afford, all the content will always be available free for everyone, but the patreon members will get access to the content 2 months in advance along with a host of other benefits as outlined in our tiers.

I don't want to monetize with ADs or paywalls at this point since that might hinder your experience as a user and hence am sticking to Patreon.

I will maintain a consistent schedule to release content. The frequency of the content I release will depend on the number of Patreon members I get. The more the members, the higher the frequency.

You can expect the first video to be released on the first week of February 2021 or as soon as I get 50 patrons, whichever comes first 😉

While I do all of this, I also plan to take up part-time/freelance consultancy/advisory roles as mentioned in the video though it has limited availability since I spend most of my time working on my startup, Timecampus.

If you have any expectations on the content, do let me know (even if you don't want to opt in for Patreon). I would like to align with your interests and deliver what is of more value to you in the best way possible more than anything else.

If you have any suggestions or would just like to have a quick chat with me, feel free to reach out to me anytime.

Looking forward to creating awesome content for you.

And if you want to reach out, this is where you can find me:

LinkedIn: https://www.linkedin.com/in/tvvignesh
Twitter: http://twitter.com/techahoy
Medium: https://medium.com/@tvvignesh
Github: https://github.com/tvvignesh
Dev.to: https://dev.to/tvvignesh
Instagram: https://instagram.com/tvvignesh

Cheers,
Vignesh T.V.

PS: I am not sure if Dev.to has tags for these kinds of posts. So, rather tagging it with the kind of content I will be releasing in my page.

Infrastructure Engineering - Diving Deep

t.v.vignesh — Fri, 15 Jan 2021 04:03:29 +0000

This blog is a part of a series on Kubernetes and its ecosystem where we will dive deep into the infrastructure one piece at a time

So far we have gone through the various principles to keep in mind while working with Infrastructure as a whole and have also discussed the impact Kubernetes and its ecosystem has in our journey towards establishing a cloud native infrastructure.

But considering there are a lot of tools and technologies in play to make this all happen, I am pretty sure that you would be left with a lot of questions unanswered. While we will be diving deeper through the entire ecosystem in this series, I feel that it is important to clear some of the clouded thoughts you may have. So, why not start proceed with this series with an FAQ (similar to what I did for GraphQL)? That's what we will do here. I have put together a series of questions and have also answered them below.

If you are new to Kubernetes or Containers, I would recommend you to start with any of these resources before jumping into this blog post:

Kubernetes Official Docs
Kubernetes Tutorials from Tutorialspoint
Kubernetes Awesome by Ramit Surana
GKE, EKS, AKS, Digital Ocean tutorials
Tutorial collection aggregated by Aquasec

Once you are comfortable with the basics, I would highly recommend you to stick to the Kubernetes Official Documentation since that acts as the single source of truth for everything Kubernetes. But if you are wanting to learn more or explore on the various tools supporting Kubernetes, I would recommend going through their docs respectively.

And if you really want to hear this all from industry experts and learn more by going through their case studies, I would highly recommend having a look at the CNCF Youtube Channel which hosts tons of useful resources in relation to Kubernetes and its ecosystem (don't miss the Cloud Native/Kube Cons)

Why Cloud Native?

In our last blog we did look at what Cloud Native is and what the stack looks like. The main reasons why you would want to go cloud native is:

To achieve maximum scalability, but get there incrementally and do it either on demand or even automatically based on constraints
Having systems which can better respond to faults rather than trying to avoid faults which is not possible when you scale a distributed system
Avoid any sort of lock-in by adopting a vendor agnostic model leveraging standards, platforms like Kubernetes and other cloud native constructs available for you to use
Accelerate the Inner Dev Loop and the complete cycle from application development to production by establishing standard automation and scalable CI/CD pipelines in place allowing for agile development and better release process and delivery of applications
Have a high emphasis on the sanity and resilience of your application be it security, monitoring, logging, distributed tracing, backups/failovers and more by leveraging various tools and mechanisms already available as part of the cloud native stack.
Handle different architectures be it private cloud, public cloud, on premise or even hybrid cloud without having to change too much in the application layer

So, this would typically mean that it is highly recommended to go Cloud Native for almost everything irrespective of your use case. Your adoption level of various tools and technologies may vary depending on your use case, but the principles do remain almost always the same.

I use Docker compose. How do I transition to Kubernetes?

If you are starting off, and have a lot of compose files to convert, then I would recommend trying out a tool like Kompose which will take in your compose files and generate the K8 yaml files for you simplifying your task from there.

But this is recommended only when you start off as a beginner with Kubernetes. Once you start working closely with it, it is recommended to have your dev environment also as a K8 cluster (be it something like Minikube, Kind, MicroK8s, K3s or anything else for that matter) or even a remote K8 cluster like (GKE, AKS, EKS and so on). This is because you will get a consistent experience starting from development to production and you will have to get comfortable with it sooner or later if you use Kubernetes.

How do I run a Kubernetes Cluster locally?

There are a lot of options like we mentioned which helps us run K8 clusters locally. Some of the notable options would be:

Minikube: A one node K8 cluster running within a VM, maintained by Google Container Tools team. Can be pretty bulky if you are low on resources and you want to run multiple clusters on one machine and also take quite some time to start or stop the cluster. Has a high compatibility with upstream K8 versions
Kind: Run Kubernetes clusters within Docker, a Kubernetes Sigs project, used as the tool to test Kubernetes itself. Starts and stops pretty quickly. Since there are different container runtimes like Docker and Podman, the behavior can be different in both, every node/control plane is hosted within its own container and leverages the docker networking for all the communication. Has a high compatibility with upstream K8 versions
K3s: A lightweight stripped down version of Kubernetes, maintained by Rancher Labs, only stable features are shipped and without many plugins leading to a very low binary size, supports auto deployment. Has a high compatibility with upstream K8 versions but you may want to watch out if you are using alpha features/plugins since you have to install them manually for you to use.
MicroK8s: A lightweight Kubernetes version from Canonical, packaged as a snap (so, you don't need a VM again), better compatible with Ubuntu than other distributions and not supported in distributions without Snap support.

There are other options as well like Firekube, or you can even use Kubeadm directly if you want 🤔. Try everything and use what is best for you.

But do remember that development and production K8 clusters can turn out to be pretty different. So, try testing in a staging environment before you ship something.

Why do I need Helm?

When you are starting off with Kubernetes, you may not need to use helm, in fact I would recommend you not to use when you start. You can just go with good old YAML files and get things working.

But Helm provides a great value when you start diving more deeper, introducing multiple environments with multiple configurations, a packaging/release process, rollback or roll forward and also acts as a package manager helping you to use a lot of already available OSS tools out there by just changing the config as you need and use them.

Considering that you don't need Tiller anymore starting with Helm 3, all you need is a client to work with Helm.

In summary, Helm can help you with things like templating, packaging/release/versioning and also be a package manager for Kubernetes.

How do I check-in secrets and credentials to my version control?

This has always been a major problem for developers and one of the main reasons behind a lot of attacks on systems like this leading to a lot of concern on the way credentials are handled while still having the need to store it safely somewhere.

This can be handled in multiple ways:

Use a credential manager like Vault which can help you manage the secrets and sensitive data at one place
Encrypt the confidential data/credentials with a Key Management Service (KMS) using a tool like SOPS and checking in the encrypted credentials to the version control. For very confidential credentials you can also use a HSM (Hardware Security Module) which typically provides the highest level of physical security

While you do this all, it is important to introduce tools in place in case accidents do happen. For instance, a service like Github does have the ability to scan for secrets in the repositories and you may want to leverage it as well. And even after this, if the worst happens, pray for the best and revoke the compromised credentials.

I use a legacy stack. How do I make my application Cloud Native?

Making an application Cloud Native is a continuous process with opportunities always available for changes or improvements. The best way to start off is to take one component at a time and migrate rather than attempting an all-out migration which may not be feasible to start with.

For instance, if you have a portion of the application which requires high scalability, try doing a lift and shift of the respective portion alone and see if the migration actually helps you by just driving a portion of the traffic to the new deployment and see how it behaves. This is where A/B testing or canary architectures with service mesh can actually help.

To keep your application available while you complete your migration, it is also recommended to have a parallel existence of the legacy architecture as well in addition to the cloud native implementation.

In many cases, making an application Cloud Native might require application changes or in very rare cases, a rewrite as well. So, make sure you evaluate all your options and step in with a clear feasibility study.

How do I build a Highly Available Kubernetes cluster?

In some mission critical cases, it might be needed to make your Kubernetes clusters highly available. But don't get too paranoid cause the more highly available your architecture is, the more complex it becomes. So, be thoughtful and assess if you really want it before you start off.

Making a cluster highly available might need doing things like having multiple replicas of the masters splitting them across multiple different zones/regions with a syncing mechanism established between their respective etcd stores as well. It might also need provisioning different nodes across different regions so that any failure in one cluster in one region does not affect the traffic in the rest. You might want to check how to create a HA cluster with Kubeadm here.

Also, if you want your application to be highly available, you may want to maintain multiple replicas of your pods also making sure that they all don't end up on the same node by leveraging labels and pod affinity/anti-affinity.

In addition to this, you may also want to scale your application at the load balancer level using a software based load balancer, leveraging DNS to make sure you don't end up sending all the traffic to the same place all the time, having static assets in the CDN which can help even when the server/cluster fails.

In summary, the scalability and availability can be and should be done at multiple different layers depending on where your bottleneck or issues are.

How do I collaborate with my team in the same Kubernetes cluster?

There are a lot of ways in which you can isolate workloads and still collaborate when you are working with your team using Kubernetes. While you can effectively use one cluster for every developer and some organizations do that which is nothing wrong since it provides a great isolation in workloads between team members, it may turn out to be unmaintainable and difficult to manage over a long time (especially if there is a single OPS team doing that) handling security patches/upgrades/managing auth and so on. To avoid such scenarios, some other ways to do this would be:

Leverage Kubernetes Namespaces - Allocating each namespace for a team member or a team as a whole and running your workloads within. While this can work very well for small teams or individuals, it becomes difficult to manage as the number of namespaces increase with increase in the number of teams/members using the cluster making it difficult for the admin to do operations like setup RBAC rules, remove unused namespaces, scale the cluster and so on. To simplify this, there are tools like Okteto keeping you sane while work on everything you have to.

Or if you have managed to have access to a separate namespace all for yourself, you can also end up using things like Swap Deployment and Proxy Deployment from Telepresence effectively allowing you to develop your service locally in your system.

Leverage Same Namespace with Headers and Proxying - While a single namespace can be used by multiple developers, things can become tricky especially if multiple developers are modifying the same service at the same time. So, one developer modifying something will in turn lead to unexpected results for an another developer since there is no isolation.

This is where headers come in to the rescue. If you use a tool like Service Preview or Bridge for Kubernetes, this is exactly what they do. They leverage a sidecar like envoy to do the routing to a different instance of a service depending on the headers being used in the requests. This is powerful because, all you need is one namespace for everything without blocking any developer from doing any changes they would want.

Or if none of this works for you, and you want complete control, you can spin up Kubernetes clusters locally and work with it with the help of tools like Tilt or Skaffold. The choice is again yours depending on what you want to do.

How do I do Site Reliability Engineering (SRE) in my K8 cluster?

Unlike what most people think, Site Reliability Engineering (SRE) encompasses a wide array of activities which also includes doing things like monitoring the system health, logging events, handling scalability, responding to incidents/failures in a timely and organized manner, building complex distributed systems and so on, that there is a whole website where Google talks about SRE here.

And even if doing all of this is required specifically if your use case is SLA critical, a lot of these are offloaded away from you if you use managed Kubernetes clusters (like GKE, AKS, EKS and so on) leaving a relatively less space to cover from your end. For instance, in case of GKE, Google manages the masters, and if you are subscribed to an update channel, they send you periodic updates with latest features both from GKE and the upstream versions, help you manage logging, monitoring and other operations with Google Cloud Operations suite (formerly Stackdriver) and so on helping you have a quick start.

In addition to this, there are a lot of amazing tools which help with different SRE problems in their own way. For instance, you can use Prometheus as your database to scrape and store time series metrics, Grafana to manage your dashboards, FluentD or Loki to do log aggregation and conditional filtering, OpenTelemetry to instrument and expose metrics in your application, Jaeger to do distributed tracing, Velero to manage backups and failovers and this list is long as we see different tools cater to different problems in SRE.

We will talk more about this in our next blog post.

Is there a dashboard I can use to visualize and manage my clusters?

The great news is, there are quite a few.

Some distributions of Kubernetes do come with its own Default Dashboard UI giving you all the basic info and control you need with your cluster. But do note that it might not be enabled by default for security reasons.
You can also use the dashboard provided by your cloud provider. In case of GKE, there is a great dashboard where you can drill down into every resource and manage it using the options provided to you which is really convenient when you want to do something very quick.
Or you can also have an amazing tool like Octant take care of this for you. Assume this like a User Interface to your kubectl client.

There are even more dashboards like Weavescope and even more tools like these. Just go for what gives you more visibility and control over your cluster with great usability and you should be good to go.

But for power user operations, we would always recommend going for kubectl since that is the single tool which is used by almost all the clients out there to interact with the Kubernetes API Server.

Do I need a Service Mesh?

Service Mesh has garnered a lot of popularity these days especially after leaders like Linkerd, Istio and Consul all have demonstrated a different way to do networking, authorization, logging, instrumentation, A/B Testing, MTLS and more with sidecars without having to modify the application code.

While service mesh is really powerful, not every use case might need one, especially when you have very little services to manage.

Adding a service mesh would need both a control plane and data plane to be setup with proxies being injected as sidecars to do the heavy lifting for you. While this might seem complex to start with, the benefits can be realized as soon as the number of services you manage increases and this is when you will reap all its rewards.

Also once the SMI Spec is widely supported by all the mesh providers (there is a good support already), it would enforce a lot of standardization to the service mesh ecosystem avoiding the need to be coupled to an implementation. Do note that a sidecar might not be well supported always with the various tools you use with your application. For instance, adding a sidecar along with your database or event queue may or may not work depending on the protocol being used.

But overall, it has a very promising future especially when adopted incrementally. So, in summary you don't need a mesh when you start but it is great to have once you have a significant number of services.

How do I manage Authentication & Authorization for the cluster and various services within?

There are various ways to do authentication and authorization and can vary depending on your context.

If you want to add authentication or authorization within your application/service you can use any of the mechanisms as you see fit including JWT with OAuth2, sessions, cookies or even basic auth mechanisms. This would require writing logic within the service and also using libs to help you with this whenever needed and redoing this for every service where you need auth might be difficult to do while definitely possible. To make this simple, you can use tools like OPA as an SDK which can generalize a lot of things for you.

While OPA has native Golang support. Other languages are supported via WebAssembly (if you want to use OPA as an SDK)

The next way to add authentication/authorization is using Sidecars if you are using a service mesh or even OPA as a sidecar. This can offload the authentication/authorization away from your application leaving just the business logic within. This allows you to just inject the sidecars whenever needed without having to worry to much about how it might break your app. The sidecars can also things like MTLS, rate limiting and more as needed by your application.
If you would like to do cluster level authorization to assign roles, policies and access controls, you can make use of either OPA Gatekeeper or rely on RBAC to get the job done for you

How do I support hybrid cloud with the help of the Cloud Native Stack?

If you are onboard Kubernetes and the rest of the Cloud Native stack, then supporting hybrid cloud can be pretty easy provided you don't use services/APIs as specific to a cloud provider. Almost every big cloud provider out there supports managed Kubernetes as a service today which you can use and if you are very insistent, you can also spin up a VM and run your Kubernetes cluster within it and manage it yourself.

Projects like Kubefed and Crossplane are especially useful here since they help you to manage and orchestrate clusters and the requests you send across different cloud providers even if its going to be across regions.

While these are the best tools to manage this kind of hybrid cloud scenarios, using a service mesh can also help if you have a Multicluster architecture like this or this setup helping you to communicate across cloud providers.

Which container runtime should I use?

Kubernetes supports multiple container runtimes due to its adoption of pod as the basic unit of scheduling. While Docker was one of the runtimes so far, it has been recently deprecated in favor of better standards like CRI removing the shim. The other recognized runtimes would be containerd, or even a low-level runtime like runc. You can read more about how they compare in this post or even this. As they mention, today making a call to Docker engine will make a call to containerd which inturn makes a call to runc. The main difference lies in the fact that every runtime has a different level of abstractions and ultimately the lowest level of the hierarchy is going to be LXC which is based on C or runc which is based on Golang.

While you can go for any runtime which supports your use case and also supported by your cloud provider, a great start to enabling this would be to build an OCI compliant image so that you can use that across with different runtimes.

What are the differences between CRI, CSI, CNI, SMI ? Why do they matter?

All of them are different standards meant to avoid possible complexities and vendor lock-in allowing for a great level of interoperability.

CRI (Container Runtime Interface) is a standard which helps establish interoperability within multiple container runtimes like containerd and others
CSI (Container Storage Interface) is a standard which helps establish interoperability between multiple storage providers avoiding the need to have in-tree plugins within the core. So, any storage provider who supports CSI can work with Kubernetes without any issues. You can find a complete list of providers supporting CSI here
CNI (Container Networking Interface) is a standard which helps establish interoperability between multiple networking solutions again avoiding the need to have in-tree plugins within the core and separating container networking and execution. There are a lot of plugins and runtimes which support CNI today.
SMI (Service Mesh Interface) is a standard which helps establish interoperability between various service mesh solutions like Linkerd, Istio, Consul and more. A lot of things like traffic access control, metrics, specs, splitting, etc. are also to be standardized so that users do not have to get locked in to a specific provider.

How do I use Kubernetes at the Edge or on IOT Devices?

Kubernetes has become so special that it is being now used in different kinds of environments including fighter planes and Raspberry PIs. This calls for a different way of thinking including adding support for offline operations and control, allowing a lightweight distribution which can run with restricted compute, run in different processor architectures and so on.

Use cases like these are made possible by projects like KubeEdge , K3s and Virtual Kubelets. You can read more about how they power the edge with different architectures and compromises here.

How do I start with Infrastructure as Code?

Infrastructure as Code (IaC) is not to be confused with Configuration Management though a lot of tools do have overlapping functionality providing features from both worlds.

There are a lot of tools which helps you convert your Infrastructure as Code most notable of which are the likes of Terraform, Pulumi, Ansible, Puppet and more each of which work differently. For instance, Terraform is declarative and uses HCL (Hashicorp Language) while Pulumi leverages the power of respective programming languages to do its job.

The best way to start with Infrastructure as Code is to go for it incrementally as you normally would for any migration (unless you are starting from scratch). You also have a lot of community resources which can help you in the process. For instance, if you use Terraform, the Terraform registry hosts a lot of Terraform Modules from the community along with a wide support for a lot of providers which you can use. Interestingly, it also allows you to manage all the resources like deployments, services and so on in your K8 cluster if you want to do it the terraform way. So, the options are endless.

But it only makes sense to adopt Infrastructure as Code if you are someone who truly leverages GitOps since it is always important to have any proposed change to your infrastructure properly reviewed by the respective stakeholders then applied also making sure that you avoid possible conflicts by using a locked state file checked in to a remote location like GCS or their respective cloud service.

If you are someone who would want to adopt the DRY principles and would want to have maintainable code, a project like Terragrunt can actually help you with this.

As we just saw, there are loads of options to go for. Just make sure that you review your changes properly before applying them without which it can cause disastrous effects.

How do I do CI/CD and GitOps with Kubernetes?

There has been a lot of amazing projects in this area as of late, so much so that, there is now a separate foundation dedicated to this. While projects like Jenkins were leaders before, the cloud native world has an interesting set of problems to be solved starting from even making your CI/CD pipeline scalable and adhere to all the cloud native principles we discussed. This is what has led to the rise of projects like Tekton which was born out of KNative, Jenkins-X (which also uses Tekton), Spinnaker and Gitlab CI with Kubernetes executors if you are on Gitlab, Github Actions if you are on Github and so on giving you a myriad of options like FluxCD, ArgoCD, etc. to play with.

When doing CI/CD on Kubernetes, you can either leave the runner to be hosted by your service provider (eg. Github/Gitlab hosted runners) or host your own runner on the K8 clusters you want. Either way, you will get the power to scale the runners as much as you need and run multiple pipelines parallelly without blocking others.

Your runner can take the job of pulling your code or binaries from the version control, building the image out of it, pushing it to the registry and also finally deploy it to the target clusters after all the necessary tests and checks are done.

Otherwise, the way you would do CI/CD in a cloud native world is similar to that of any other CI/CD pipeline.

How do I do serverless in Kubernetes?

Well, you always have servers. Serverless according to me is nothing but abstractions which helps you forget about the servers and all the complexities and scalability behind allowing you to focus on just the business logic at hand and Kubernetes allows you to do this as well.

You can opt for projects like KNative, OpenFaas or Kubeless all of which allow you to run serverless workloads on Kubernetes. But if you are using cloud providers like GKE, they provide their own solution like Cloud Run which comes closer to hosting your own serverless platform. Ultimately, the containers spin up/down in your cluster depending on the compute they need.

How do I choose my cloud provider?

This is a huge debate to have and can depend on both your use case and a lot of other factors.

But there are few things to keep in mind while choosing a cloud provider.

Make sure that:

They provide all the basic services you need to help you with your use case
They provide the services at an affordable cost, even as you scale up/down
They have great documentation, support and developer relations team
The services they provide don't lock you in to their platform
They provide services in most of the regions you would want to serve your customers in
They provide a great emphasis on security, performance and usability in all their offerings
They have an SLA to offer you depending on your needs and also have a good track record on quickly responding to possible incidents
They satisfy all your compliance requirements and also have the necessary certifications to prove the same
They have a rapid innovation culture which can support you in your future ventures or as you scale up

These would just be few of the criteria to start with. But above all, make sure that you give it a try yourself before going for it.

Where can I know more about the Cloud Native projects?

The best place to check this all out would be here: https://www.cncf.io/projects/

While this does not cover all the Cloud Native Projects (since it only hosts projects within the CNCF foundation), this would be a great start for you depending on the domain you want to explore more on.

But if you want to explore other projects, you can also have a look at the Cloud Native Landscape here: https://landscape.cncf.io/

While there are a lot of projects in that list, do note that all of them are in different stages of maturity levels ranging from Sandbox to Graduated. So, please be mindful of that when you are choosing something since it might undergo significant changes over time.

Are there some case studies which can actually help me in the implementation?

Yes, and there are lots. For a start, you can go through the Kubernetes specific case studies here and case studies from other CNCF projects here.

If you want to know more about how a specific organization uses Kubernetes, its highly likely that its already out in Youtube. Just give it a search and you can find tons of them. Or if you want to explore more, all you have to do is head over to the CloudNativecon | Kubecon and you will find a lot of speakers talking about their experience in using Kubernetes and all the tools available out there.

How do I contribute back to the Kubernetes and Cloud Native community?

Every bit of contribution really matters. And there are a lot of ways in which you can help.

Contribute to the K8 Docs
Contribute to K8 with bug fixes, enhancements, failing tests, feedback and so on
Help the community by joining the various channels within the Kubernetes Slack community
Contribute to all the CNCF projects or projects from the Cloud Native landscape
Write blogs like these, host meetups, speak in conferences about your experience and evangelize your best way possible
Join the CNCF foundation and support the projects directly/indirectly
Find a problem not addressed by anyone in the community? Propose and develop your own solution and contribute back to the community

There are a lot of small ways in which you can give back. Small or big does not matter. Every contribution counts.

Hope this was informative. Do you have any question which I have not covered in this list or are you looking for some help or engineering advisory/consultancy? Let me know by reaching out to me @techahoy. I will be using this blog like a living document and will be updating it with more useful Q&As when I find time.

If this helped, do share this across with your friends, do hang around and follow us for more like this every week. See you all soon.

Infrastructure Engineering - The Kubernetes Way

t.v.vignesh — Fri, 01 Jan 2021 13:42:25 +0000

This blog is a part of a series on Kubernetes and its ecosystem where we will dive deep into the infrastructure one piece at a time

In the previous post, we discussed "The First Principles of Infrastructure Engineering" setting the context to this series discussing all our visions and goals when working with incrementally scalable and adaptable cloud native infrastructure to cater to the most important demands of today. With that settled, let us start our journey in this direction and see how various tools and platforms make that all happen to realize our goals.

While we will talk about infrastructure, the first parallel we can draw is Kubernetes and its ecosystem itself given it addresses a major spectrum it covers and also the thriving ecosystem, community and enterprise backing around it. And that is what we are going to talk about in this blog post. On how Kubernetes addresses the various pressing challenges in the infrastructure as a whole.

You may ask, is Kubernetes the only player which does this?

Well, the answer is NO, and there are definitely other players like CoreOS and OpenStack. While you should choose what is best for you, there is a great future for Kubernetes and its ecosystem especially because of the huge community backing which is more evident if you take a look at the thriving landscape here. And most of the times K8s should be a great fit for most of the use cases.

But this does not stop you from using them together with CoreOS and Openstack. For instance, you can run Kubernetes on Openstack or Openstack on Kubernetes if you wish.

What does the Cloud Native Stack look like?

Just getting your application running within a Virtual Machine on a cloud provider does not make you cloud native. But as Pivotal said, Cloud native is an approach to building and running applications that fully exploit the advantages of the cloud computing model which means that we need a completely different thought process on how we look at infrastructure.

These are some of the important parts of the cloud native stack as mentioned by Mr.Janakiram MSV in The New Stack.

And each block you see in this image is a completely separate problem to be solved when working with applications at any scale. Let us see how Kubernetes and the various tools takes up this difficult job abstracting a lot of the complex details required during onboarding and maintenance of infrastructure.

(You can go through the book from The New Stack on The State of Kubernetes and its ecosystem here: https://thenewstack.io/ebooks/kubernetes/state-of-kubernetes-ecosystem-second-edition-2020/)

Compute

Every application requires some degree of computing power depending on the business logic and the demand and the infrastructure should allow for incrementally scale up or down compute as needed depending on the ever changing requirements.

Kubernetes (which is essentially a project conceptualized from Borg at Google) allows for managing compute efficiently and scale it up/down as needed.

The compute available for the application depends on various factors. The type of node in which you run the containers (the number of CPU cores available to the node, the processors attached, the RAM available and other accessories like GPUs in some cases which are compute intensive), the constraints you have set manually including the cpu, memory limits, pod affinity/anti-affinity and so on.

And the best part about it is that all this compute is shared within the cluster or a specific namespace thereby allowing you to effectively manage the resources you have. And if you are not sure about the change in the computing needs, all you need to do is setup an autoscaler (HPA, VPA, CA) and that will scale your compute up/down depending on the constraints you set.

Now, what if you want to do serverless? Its easy when especially when you work with the Kubernetes ecosystem that you have projects like KNative, OpenFaas, etc. doing exactly this (Ultimately, serverless is nothing but abstracting everything for the developers and offering the elasticity in compute as needed)

Or if you are someone holding a legacy infrastructure and cannot afford to run all your compute on containers, you can go for projects like Kubevirt which allows for exactly that running containers and virtual machines side by side in your cluster.

Having all these choices makes Kubernetes a great choice for managing compute at any scale.

Storage

Managing state is difficult for a reason. Managing your storage improperly may lead to bottlenecks in your application as you scale. Every disk is limited by its IOPS (Input/Output Operations Per Second) and as you scale, the storage layer might cause quite a few problems and may often become the bottleneck.

This might require clustering your storage possibly with multiple replicas and to add to this, you would also need to maintain periodic backups and handle seamless failovers. Now, while Kubernetes does have its way of managing storage using Persistent Volumes, Claims and Storage Classes it does not handle every problem with respect to storage and rather leaves it to the storage provider.

Its has been possible to plug in many storage provisioners to Kubernetes and add many volume plugins but the future aligns itself towards CSI and is therefore promising for compatibility with any storage provider of your choosing.

For eg. if you would like to bring in file systems like Ceph onto Kubernetes, you do have a Ceph-CSI driver or if you would like to reduce friction while maintaining Ceph, you can even go for something like Rook with Ceph CSI. There are many players in this space and you can read about how they perform and differ here each providing different set of options including File, Block, NFS and more.

If you would like to get more idea on the various storage options available, I would recommend you to watch this talk by Saad Ali from Google:

Network

Networking is critical for any application since it essentially is the data path for all your requests, responses and internal communications. This becomes even more important in the Kubernetes world since your cluster is going to be composed of various pods across same or multiple namespaces each fulfilling specific responsibilities, and having the ability to communicate with each other is critical to the application's working.

This is made possible with the inbuilt service discovery mechanisms from Kubernetes like CoreDNS (Kubedns being its predecessor) with some help from Kube-proxy and also the basic Kubernetes constructs like Services and Ingress ranging from various types including ClusterIP, NodePort, LoadBalancer and more.

Having constructs like these makes it very easier to establish communication via labels rather than having to hardcode the IP addresses or do complex routing all of which is handled by the DNS server if given the proper FQDN (Fully Qualified Domain Name)

Networking becomes more interesting especially with introduction to CNI (Container Networking Interface) which has actually become the standard for all the networking abstractions (similar to what CSI does for storage). This allows users to use any CNI plugin underneath amongst the large number of in-tree and third party plugins available like Calico, Weavenet, Flannel, etc. and you can look at a comparison of some of them here.

CNI even empowers service mesh implementations to leverage its power. For instance, LinkerD has its own CNI plugin which takes care of managing the IP Table rules for every pod in the service mesh and Istio has something similar too.

Having all of these as Kubernetes constructs also allows us to moderate traffic using Pod Network Policies and the possibilities are endless.

All of this makes Kubernetes a great fit for networking as well. And if you would like to know more about the networking stack, I would recommend you to watch this talk by Bowei Du and Tim Hockin from Google:

Or this talk from Michael Rubin from Google:

Container-Optimized OS

With more and more virtualization and isolation in place, it calls for trimming down the operating system, kernel and all the packages used so that the resources can be used efficiently and there are less bugs to fix or patch. This is why a container optimized OS came into place. For instance there are distributions like Flatcar from Kinvolk, RancherOS from Rancher and cloud providers even maintain their own versions of COS like this, this and this.

If you would like to know on why a container optimized OS is really needed, I would recommend you to watch this talk:

Container Runtime

Docker initially started out as a pioneer for containers (even though cgroups and containers were used even before Docker in production) and this brought about a lot of people using Docker as the container runtime when it all started. But Kubernetes looked at it in a different way with the basic unit being a Pod. While Kubernetes initially came out with support for Docker using Shims and has been maintaining it, it recently announced that it is deprecating Docker as the container runtime in favor of CRI (Container Runtime Interface), a standard which can provide way for standardization for various container runtimes including Docker, Containerd, runc and other runtimes to come like CRI-O

If you would like to have a look at how they differ, this should give a better idea and if you would like to explore more, you can also watch this talk by Phil Estes from IBM:

Service Mesh

Service Mesh is meant to provide you a platform to incrementally scale networking without chaos, maintain control in the communication between services and also offload a lot of operations like authentication, authorization, encryption, tracing, canary testing and more away from the application thereby reducing the complexity as your application scales. But remember, service mesh might be an overkill for small projects and those that don't work with microservices but can definitely benefit as you have scale up your application.

Tools like Istio, Linkerd and Consul are known for empowering users with their great service mesh offerings each running a sidecar along side the actual container. While setting up service mesh was initially complicated, things are gradually coming to a point where there are CLIs and charts which can set up everything for you in one shot.

The Service Mesh Interface (SMI) also paves way to a lot of future developments in this space bringing in a lot of standardization and interoperability without having to change the way your application works. If you would like to know more about what service mesh is and whether you really need a service mesh, you can go through this blog post from William Morgan, creator of Linkerd or you can also watch this talk by Brendan Burns from Microsoft:

Security & Hardening

Security as people rightly say must be a first class citizen for everything we do. While Kubernetes is intentionally not secure by default, it provides you a lot of options and ways to secure the infrastructure and all the workloads you run on top. This can include anything from controlling access to your cluster, the namespace, controlling inter-pod traffic, preventing rootful access, scanning images for vulnerabilities and more and each needs a different approach to be used in order to reach the goal of Zero Trust Security (eg. Beyondcorp).

There are a lot of steps to security in Kubernetes. For instance, you can use RBAC for authorization, encrypt the etcd store, scan the image registry using something like Clair, adding sidecars or libraries with Open Policy Agent (OPA) to allow for defining policies as per your need, adding the appropriate Security Contexts and Pod Security Policies to control the permissions your containers has and a lot more like configuring MTLS to encrypting traffic within your mesh and so on each of which brings you one step closer to your journey with Zero Trust Security.

Container Orchestration Engine

Now that we have all our major part of the underlying infrastructure in place and all the workloads we need to run on top, it is important to have an orchestration layer to manage all the infrastructure we have. This is where a project like Kubernetes falls into place. But as we discussed already, Kubernetes is not the only option out there (though its definitely the leader in this area). There are other options available like OKD which powers Openshift, Rancher Kubernetes Engine (RKE) and more.

And even if you go for Kubernetes, you will find various distributions depending on the cloud provider you go for or the environment you would like to run your clusters in. For instance, you could use Kubeedge if you would like to work with IOTs allowing you to run the cluster offline, optimized to run on smaller hardware with support for different processor architectures.

Build Management

Once you have your applications ready and have checked in the latest code to your version control, the next important step is to actually produce the artifacts you need in order to push the relevant images to the registry and then pull it to run in the cluster. This can be preceded by a lot of steps including scanning for vulnerabilities, running automated tests and so on all of which are typically automated in the DevSecOps flow.

Some of the pioneers in this area does include Gitlab CI, Jenkins-X, Github Actions and more which can be combined with tools like Skaffold, Skopeo and the likes of it to bring about a complete pipeline to manage the builds.

There are a lot of tools you can use to build your OCI compatible images. This can include going for a container builder like Docker, Buildkit (again from Docker/Moby), Buildah and more as shared here. It all boils down to what you would like to optimize for. If you would like to know how they compare, you can have a look at the presentation by Akhiro Suda from NTT here on the various builders and how they differ.

If you are interested in knowing more on how to setup your CI/CD pipelines, I did write a small tutorial on configuring Skaffold with Gitlab CI/CD connecting to a GKE Cluster pushing/pulling images to/from a GCR registry here

Release Management

Once the application has been tested, built, scanned and pushed the final step in the pipeline is to actually release the application to multiple environments as needed. This is where release management plays a major role. Irrespective of the tool being used, the final asset to be applied to Kubernetes is nothing but a YAML or JSON manifest providing all the details of the various resources.

This is the same case even if package managers like Helm is used since it does nothing but template the YAML files as available with various values as supplied by the users which can later be pushed to an OCI compliant registry for versioning and sharing as releases. The CD Foundation also hosts a number of projects under its umbrella meant to address exactly the same problem.

If you would like to go through Continuous delivery with Kubernetes in detail, I would recommend going through this from Google.

Container Registry

As we discussed in Build Management, the artifacts generated (be it container images, helm charts, etc. ) are typically pushed to an OCI compliant registry which can then be used as the single source of truth when deploying to the Kubernetes clusters.

While it all started with Docker Hub leading the way, you have other options like Quay, GCR, ECR, ACR providing Image registry as a SAAS service, and if that doesn't work for you, you can even host your own private registry using something like Harbor making the possibilities endless.

These registries often come with features like user account management and authorization allowing you to share the same registry with multiple users and clients as needed with appropriate tokens or credentials.

Observability

Now that we have a huge stack already in place, the next important thing to keep in mind is the fact that all of these have to be monitored proactively for anomalous behavior indicating possible issues, failures or changes in behavior of the entire system. This becomes really important with scale because it is very difficult to drill down on any problem with a microservices architecture in place with multiple tools and services working together unless an observability stack is in place.

This is one of the places where the Kubernetes ecosystem really shines. You can go for an ELK (Elastic, Logstash, Kibana) stack, EFK (Elastic, FluentD, Kibana) stack, use Prometheus for metrics, Grafana for your dashboards, Jaeger for Distributed Tracing, and so on with new players coming in like Loki, Tempo, etc. making this space really interesting. Or if you don't want to bother maintaining this stack, you can also go for Saas providers like Datadog, GCloud Operations (formerly stackdriver), etc. after setting up instrumentation in the applications where applicable.

Now, this becomes even more interesting if you have service mesh in your equation. You can observe and visualize a lot of application level metrics without changing the code since any sidecar like Linkerd or Envoy exposes its own set of metrics to be collected for visualization.

But what about the agents which expose all the data and metrics? Things are more interesting than ever with the merger between Open Tracing and Open Census creating Open Telemetry which is a standard to look forward to for all observability in the future.

If you would like to get a more comprehensive idea on the future (or should I say present) of observability, I would recommend going through this talk by Tom and Fredric:

Developer Experience

This has been an area of friction so far for all the Kubernetes users since it takes quite a lot of time to get the instant feedback for the changes they do because of all the complexity with Kubernetes comes with. You have to write code, build it, push it to the registry, run it in the cluster and if you want to debug, you are left with not so great ways to do it.

In other words, The Inner Dev Loop has not been that good so far with Kubernetes until recently since there has been an emergence of a huge number of tools to address this problem. Though there are a lot of tools which tries to address DevEx (Developer experience) with Kubernetes, they don't cater to every scenario out there, so you might want to choose the right tool which works for you depending on whether you use Local Clusters, Remote Clusters or Hybrid Clusters for development.

Some of the tools would be Skaffold with Cloud Code, Tilt, Bridge for Kubernetes, Okteto, Service Preview with Telepresence, Garden and so on each addressing the similar problem differently. While they have made the job a lot easier, there is still some more way to go, since according to me, Kubernetes should essentially fade away behind the scenes allowing developers to focus just on the logic at hand but I am very sure that we are in the right path towards it.

So, with this, I hope you got an idea on what the typical "Cloud Native" stack looks like and how we manage our infrastructure "The Kubernetes Way". There is a lot more we can talk about, but I will save it for the future blog posts where we will dive more deep into each part of the stack we talked about in this blog.

I would just like to conclude here saying that making your application "Cloud Native" is a continuous journey since you always have room to improve in one place or another and this should be done incrementally in different phases without getting overwhelmed. Trust me, I have done it and I know how it feels, but if you start small and scale gradually, then it should be an interesting journey.

Any questions? Looking for help? Feel free to reach out to me @techahoy. And if this helped, do share this across with your friends, do hang around and follow us for more like this every week. See you all soon.

Infrastructure Engineering - The first principles

t.v.vignesh — Sat, 19 Dec 2020 11:25:08 +0000

This blog is a part of a series on Kubernetes and its ecosystem where we will dive deep into the infrastructure one piece at a time

There are 4 main steps typically to solving any problem out there in the world.

Understanding or identifying the problem
Researching or thinking of possible ways to solve the problem
Working on the actual solution
Publish or release your solution to your target audience for feedback

And iterate on this entire cycle continuously. And this is typically the workflow we tend to follow for any problem irrespective of the domain.

Now, if you notice this in context with Infrastructure, OPS and Software Engineering, the infrastructure itself has been taking a huge toll on the process creating a lot of friction in the workflow while it is supposed to be just a facilitator to solve the main problem at hand (unless the problem you are trying to solve is related to the infrastructure itself).

But things are taking a positive direction today with the advent of various standards, platforms and tools today with Kubernetes, Docker, Terraform and various other tools leading the way sharing a path to a positive future.

While I am in my journey towards building Timecampus, I was researching for a standard to adopt something along the lines of the Twelve Factor App for SAAS apps, Agile Manifesto for Agile development, Principled GraphQL for Data Graph architectures with GraphQL and so on but for the infrastructure and to my surprise noticed that there was none that I could find.

So, before I start off addressing the problems Kubernetes and its ecosystem solves (or can potentially solve in the future) for infrastructure in this series, I feel that it is very important to lay down the first principles around which we should visualize infrastructure and that is what we are going to discuss in this blog post.

NOTE: These principles are from my own experience and also the knowledge I have gathered going through many case studies over time. So, feel free to interpret this in your own way or dismiss any of this if it is not a fit for your use case.

Standards, Abstractions and Encapsulation

In reality, infrastructure is composed of various components each fulfilling a specific responsibility as part of the complete stack with responsibilities ranging from networking, compute, storage and more over time with a lot of peripherals now available to use. While it is definitely healthy to have a freedom of choice and competition, it is very important to note that this should not be presented as a challenge, but be seen as an empowering force for the users.

And this can only happen only if there are sensible standards and abstractions around all of these disparate components thus abstracting all the complexity away from the end users but at the same time, allowing them to modify it wherever needed.

Infrastructure should be seen as nothing but a "Black Box" which is decomposable into multiple different black boxes as needed with each black box fulfilling a separate responsibility.

Automate Everything

As discussed before, infrastructure should act as an enabler and not a bottleneck with your valuable time better spent on working on the actual problem rather than the infrastructure itself. Keeping this in mind, every operation with the infrastructure should be automated to the best extent possible.

And this can be anything which is part of the process including healing, recovery, failover, deployments, migrations, upgrades, log shipping and more. If you find yourself doing something repeatedly, then consider automating since it can save countless hours of time in the future and can also prevent possible human errors in doing things manually.

Extendibility

While it is important to abstract and encapsulate the various infrastructure components, it should also be extendible by anyone as needed via something like APIs which allows users to build on top of these components as specific to their use case.

This is important cause not every use case in the world can be natively addressed and adding them all to the core would most often seem nonsensical. Keeping this in mind, a plugin driven approach is often empowering to the entire ecosystem in general.

Environmental Consistency

While infrastructure can typically vary between different environments be it development, staging or production, across multiple different deployments be it on-premise, private and public clouds or even different cloud or on-premise providers, the experience of the user should be consistent across all these different environments abstracting the various differences between them typically as adapters.

So, while users interact with the same APIs in the same way consistently across different environments, the way the APIs are handled across different environments should live as adapters or sandboxes thus acting as a compatibility layer on top of the possibly different Infrastructure APIs.

Packaging and Release

Environmental consistency also requires a consistent way in which you would typically package and release your application or the infrastructure itself. A consistent packaging and release mechanism would create no surprises when working across different environments and also help to reproduce any inconsistencies or failure events which may occur in a different environment without any issues. This also helps to properly version control various releases over time and allow users to gradually update or upgrade portions of the infrastructure as they see fit after appropriate review of the changes without forcing something onto them.

This mandates the need for all components in the infrastructure to maintain its own versioned packaging and release mechanism allowing a consistent and clear upgrade path for every component.

Modularity, Single Responsibility & Incremental Scalability

Every piece of the infrastructure as discussed before should be a black box by itself and thereby modular with its own set of APIs each fulfilling a single responsibility which are then orchestrated together to fulfill the complete use case. This not only reduces the possibility of errors creeping in with each module handling its own release cycle, but it also allows for things like incremental adoption and scalability from small to large projects just using the modules which they really need to satisfy their use case.

This also allows for scalability at the module level depending on your use case.

Administration

While Infrastructure works underneath as loosely coupled modules, they should be easy to administer for end users wherever needed. It is ideally visualized as a honeycomb with each cell working on fulfilling its own responsibility.

Users must be able to easily monitor and administer any of these components as needed optionally abstracting the complexity of what happens underneath offering a way to respond quickly in the advent of issues which can typically crop up in any part of the infrastructure and also allow for use cases like cost optimization where infrastructure can be effectively utilized where possible and security where the path of communication between all components is better visible for the administrator to enable them to secure it properly as needed.

Declarative and Self-Documenting Infrastructure as Code

As we discussed above infrastructure is made of a lot of components and one thing that is common over time is "change". We should try to architect for change and this is where Infrastructure as code really helps us to well document and orchestrate all the different states of the infrastructure over time declaratively, version control them and do it all in such a way that there is also a path to roll back to a previous state where needed.

This also paves way for collaboration, reviews, automation and testing where the changes between 2 different states can be well tested in different environments and then consistently rolled out to multiple environments as needed.

Sensible Management of State

The problem is not in the state, but in managing it properly. Avoiding state is often difficult especially given the dynamic nature of applications today.

Maintaining state improperly can cause a lot of issues in scaling your applications and the infrastructure that goes with it. So, rather than spreading the state everywhere, it is often sensible to manage the state properly and in isolation so that the rest of the application can work and scale independently of the stateful services while the stateful services can follow their own path to scalability decoupling both worlds from each other.

Faster Inner Dev Loop and Agility

While you may maintain a complex infrastructure and a rollout process for your use case, the same complexity should not be exposed during the development phase. Rather the development experience should be such a way that the infrastructure completely fades behind the scenes, and the developers just focus on getting quick feedback for all the changes they make.

This can involve a lot of techniques like using sync instead of rebuilds, allowing hybrid development allowing developers to just host a portion of the infrastructure they work on locally leaving the rest in the target environment, custom network proxying, better IDE integration and more all done to speed up the inner dev loop and agility of development.

No single point of failure

For an infrastructure to be reliable, there should typically be no single point of failure, otherwise called highly available systems. This calls for decentralization of responsibilities and also to have things like redundancies and failovers where needed. But it is also important to consider that a 99.99% uptime might not be the call for every use case and hence different use cases call for different levels of redundancies and failover mechanisms making this decision completely dependent on the use case.

Having said that, it is always better to design simple systems at start and incrementally scalable systems over time as the need arises. So, building highly scalable systems are a journey and not the destination by itself since more decentralization creates more complexity.

Support for Seamless Collaboration

Products are often built with the effort of multiple members of a team and the same goes for infrastructure and the applications running on it as well.

Keeping this in mind, it is important to support seamless ways to collaborate while working on the infrastructure facilitating parallel changes from multiple people, allowing the ability to research and experiment with any part of the infrastructure in an isolated context or sandbox without affecting the others and also the ability to share or package portions of infrastructure as needed facilitating seamless collaboration between different members and this should be done so keeping security in mind allowing only authorized personnel to make changes as needed but do so in a way without creating friction amongst users.

Zero Trust Security

As experts rightly say, security shouldn't be an afterthought but should start as the core principles when building or working with the infrastructure. It is often naïve to say that "This cannot be compromised" since we all know what happens. Rather the thinking should be along the lines of "If this gets compromised, this is what will happen and this is how I will respond" and this is where Zero Trust Security really plays a major role where you never trust anything or anyone at anytime (while in real life you should).

This gives a first look on all the ways your infrastructure can be compromised and helps you take preventative measures and be ready for anything before the bad stuff actually does happen. This can involve anything from reviewing the access policies, reviewing the network paths, encrypting data or other PII, separating data into multiple tenants, scanning the images, reviewing the underlying infrastructure for security concerns, separating the boundary of attacks or anything similar which can either prevent or reduce the impact of the attack.

Infrastructure should be dumb

While this might sound crazy, the true fact is that the more dumb the infrastructure is the way better it is for stability and maintainability. As we mentioned above, every module should have a single responsibility and this should be in such a way that there is always a cause to a change in the infrastructure.

For instance, if your infrastructure has to auto scale, it is often better to do it without introducing too much complexity to the infrastructure or its controllers itself. Rather, try to dumb it down as much as possible and use some sensible criteria either from experience or from the historical data to take decisions on the scale either manually or automatically depending on how often such changes are needed. Asking your infrastructure to do too much for you will just raise the complexity. Or if you are indeed looking to create a smart infrastructure system, try to spread the complexity rather than keeping it locked into a single component.

Discoverability

As we discussed, there are various components to an infrastructure. Considering this, it is very important to have discoverability in mind when designing these systems since they might often need to work together to complete a specific task which can happen only if they are discoverable as and when needed.

While the mechanism of discoverability itself boils down to your use case, the one thing to keep in mind is that it is better to distribute discoverability since we don't want a single point of failure and also the fact that it should be easy to understand the various calls and the network traffic well without introducing too much complexity even as the number of components and services increase over time.

Eventual consistency

Distributed systems and infrastructure are always eventually consistent considering the various factors like network latencies, bandwidth, throughput, I/O speeds and so on. So, rather than avoiding this fact, designing a system to support eventual consistency is always better since it will help you to better handle failures or data lags whenever it occurs.

While there are mechanisms like distributed transactions in place to handle this problem, even such solutions are nothing but eventual consistencies with abstractions on top. Keeping this in mind, infrastructure and applications are better designed for such scenarios rather than expecting everything to be consistent every time.

Testability

Testability becomes a really important part especially when you want a high degree of stability and automation in place with the constant influx of change you might have in the future. Every part of the infrastructure should be testable in isolation and also as a group to bring about a greater degree of confidence the users would have on the underlying infrastructure.

This can be anything from testing the various integration points between the infrastructure to testing the logic within giving a complete picture of what went or can go wrong.

Sensible defaults

While we discuss all of these principles, effort must also be made to provide sensible defaults wherever possible allowing the users to incrementally adopt, change or scale as needed. Sensible defaults includes making every piece of infrastructure secure by default, allow basic functionalities like monitoring, logging, backups, failover, etc. without having to manually configure it to work.

This can prevent a lot of issues which just happen due to the lack of knowledge users might have while working on a part of infrastructure, dumb it down for new comers and also help users to start any project as quickly as possible without going through too much of configuration to start working on something.

Seamless upgrades, updates, maintenance and failovers

While cloud providers do offer an SLA for the deployments when paid, this should not be a feature but rather the default. Upgrades, updates, failovers and maintenance should happen seamlessly without affecting the workloads running on the infrastructure. And even in case of failures, there should be a way to still serve sensible traffic to customers as and when needed avoiding possible downtime.

All this can happen only if a possible failure is detected as early as possible and effort is made to switch over to healthy traffic and deployment wherever available. To the same effect, proper health checks should be in place to identify and handle failures if any in any part of the infrastructure and even handle upgrades seamlessly as they occur.

Fail Quickly and Respond Sensibly

It is not practically possible to have a 100% uptime for every portion of your infrastructure. Rather, the components should fail quickly whenever there are issues and respond sensibly whenever possible. It is often better to let failures to happen rather than suppressing them which might create a catastrophe at a future point of time.

And these according to me are the most important principles to keep in mind especially when working with scalable infrastructure.

Do let us know if we have missed out on something important here since I would like to treat this as a living document updating it as the need arises based on feedback from the community.

Hope this has shed some light into the various things to keep in mind while working with infrastructure. But as complex as these may sound (and they are indeed complex), there are a lot of tools in the community today and a huge ecosystem which can help us do all of these (at least for the most part).

As I was writing this blog post, I got a small idea of making the first principles across all the domains available for everyone at one place.
With that aim in mind, I would like to announce the launch of a new site to do just that https://thefirstprinciples.dev

From the next blog in this series, we will start exploring how the various tools in the ecosystem fit right in to address these various principles and solve the difficult challenges lying ahead in front of us.

GraphQL - The Workflow

t.v.vignesh — Fri, 11 Dec 2020 03:33:44 +0000

This blog is a part of a series on GraphQL where we will dive deep into GraphQL and its ecosystem one piece at a time

What an interesting journey has it been so far! We explored some of the amazing libraries, tools and frameworks which really empowers GraphQL to be what it is today with almost everything being open source created with love from the community. But, I do understand that, this can actually be overwhelming for some of you who are just starting off this journey with GraphQL and may have some trouble putting it all together to work for you.

To address this, we will be talking about the workflow with GraphQL and the tools we have looked at so far and the process of taking it from development to production in this blog.

NOTE: While these steps have been ordered serially, that is just to give you a sense of understanding of the workflow. Some of these can also be deferred for later or parallelly done if working with multiple teams. So, with that in mind, let's start.

Step 1: Evaluation and Research

As we have discussed before in this blog post, GraphQL may not be the solution to every problem. And even if it does satisfy your use case well, this would be the first thing you would need to do. Understand why you want to use GraphQL, how you would like to use it and the ecosystem of tools you need to solve your problem. This can be done only if you introspect your use case and really get to the basics answering few obvious questions around GraphQL and your use case.

Also, you have to remember that not every organization is operating at the scale of Google, Microsoft or Facebook and what works for them need not work for you. So, while you should definitely be informed on how people do things, do remember that you need to focus on what works for you and what you really need.

Step 2: Get a boilerplate stack ready

GraphQL can get overwhelming if you are re-inventing the wheel every time. From putting together your schema, resolvers, server, to the various tools you would typically use with it like linting, codegen and so on. Now, doing this every time you work on a new service is not a good use of your time.

The best way to avoid this is to put together a boilerplate with all that you would typically want to use and this can become a starting point of all the services you may develop in the future. This would also involve things like setting up your GraphQL Gateway (incase you are using something like Federation or Stitching) since the gateway becomes the single point of contact for all your requests from the clients.

Now if you are using something like Typescript/Javascript, the tooling you might want to start off with this stage would typically be a GraphQL server like Express GraphQL, Apollo, Helix, Mercurius or anything else which might work for you, putting together something like GraphQL Modules if you are looking to split your resolvers into multiple modules, devising a mechanism to merge together multiple GraphQL schema as the need arises with something like GraphQL Tools, setting up GraphQL Config to help all the tools work in tandem with your schema, getting Codegen and its extensions/presets setup so that you can re-use the types as generated from your schema, getting ESLint setup with your own validation rules, having something like GraphQL Inspector ready so that you can do various operations with your schema like validation, mocking and everything you would typically want as part of your tooling and even having your editor/IDE setup with appropriate extensions and tools to help you with the development process.

While you can definitely iterate with this as you go along, having the barebones when you start can definitely take a lot of effort away and save a lot of time in the future.

Step 3: Putting together the data graph and documentation

All that you do with GraphQL for your use case mainly revolves around your schema and its types since that becomes the base of everything you would develop on top. Getting your data graph ready would typically be the next important step and the way you do it does not matter. You can either go the SDL-first or code-first route depending on what works well for you.

You might also want to write appropriate documentation parallelly as you work on your schema, especially since GraphQL is self-documenting and it is always good to do it when you have a context of what you are doing rather than as an after-thought.

Now, if you are working on a microservices architecture and you are looking to split the data graph into multiple parts to be composed or stitched from multiple services, using something like Federation or Stitching, you would also need to understand the clear boundaries of the microservices and how all of them relate to each other through the data graph.

These boundaries will also decide which service hosts your resolvers/logic to go along with resolving the various fields in the schema and performing the business logic as needed in isolation.

Step 4: Deploying it all as per the need

Now that you have your boilerplate and data graph ready, the next step you would typically do before working on your resolvers or any of your business logic is to actually deploy it all wherever you want to, and the way you are looking to do it. Be it public cloud, private cloud or on-premise as containers, VMs or bare metal.

Doing this will help you proceed forward as per your architecture be it single/multi-tenant and help you resolve the major questions you might have regarding the end-end flow of data considering all the compliance policies and laws you might want to cater to.

Step 5: Mocking and Testing Client Consumption

Now that you have everything deployed and ready, the next step you would typically do is testing it all together. Now, you might wonder how this will even work without any resolvers, or a backend to serve the data with.

While you can definitely spend your time writing the resolvers, business logic or connecting your backend, you first might want to test out the end-end data flow so that you get a validation on how clients would typically interact with your GraphQL API. To do this, you can either mock your schema or hard code the data initially in your resolvers and then serve the schema and test it all end-end.

This will establish a confidence about your development workflow, give you a clear idea about the data path, how your GraphQL operations (Mutations and Queries would look like), an insight into how you can consume the data and also presents you with opportunities doing things like end-end type checking, code generation and so on with your clients.

Step 6: Getting the resolvers and backend setup

As you might already know, with GraphQL, your clients don't have to worry about the data source, your backend logic and the various complexities that go with it since they are all abstracted away and this helps you scale the backend and frontend independently of each other.

To do this, try treating your resolvers as just entities which do an operation and respond back with data given a set of inputs (similar to what you would typically do with a REST API). So, try setting up your backend/datasources from which you would want to serve the data (be it a database like Postgres or Mongo with or without an ORM like Prisma, Knex or Sequelize, or even an underlying resource like a REST API maybe with something like GraphQL Mesh or Graph databases like Dgraph) and also your resolvers to process the data as you see fit, adding your business logic on top and return back the fields as needed by the resolvers. This is the point where you replace the mocked data with data from the backend.

Step 7: Optimizing the data path

Now that you have connected your data sources and added your business logic with all the resolvers you need, the next step would typically be to optimize the data path to make sure that you are not doing repeated calls to the database increasing the load and bandwidth usage and reduce the roundtrips and processing needed as much as possible and also provide faster response times as the clients ask for data.

This is where you setup things like batching and also solve N+1 problems with something like a dataloader, setup caching with something like Redis or even an LRU cache to act as a proxy for the frequently accessed data whenever and wherever possible, optimizing the network chatter by using something like persisted queries, optimize your resolvers by retrieving as much data as possible from the parent resolvers, setting up pagination to limit the results returned, setting up things like query complexity to control the level of nesting and computation performed, rate-limiting in the gateway to avoid things like DDOS and so on.

This is really important cause, while GraphQL might provide your clients with a high degree of flexibility, it also comes with its own set of risks if not used right. So, try to keep even the worst case scenarios in mind and design for failures. Do remember that sometimes, it is better to make your application fail and crash rather than having to make it do the wrong thing.

Step 8: Controlling and Securing the data path

GraphQL provides all its clients with access to any data as they request it and while this might sound empowering (and it is), it is not without its own set of risks. You have to make sure that only the authorized clients have access to the data and only the data which they are allowed to have, only when they need it providing a proper context and purpose to the operation.

To do this, all the clients need to be authenticated properly whenever and however needed, authorization rules needs to be setup for all the fields either via directives, resolvers or any other mechanism which works well for you, have an encryption/decryption mechanism for confidential data like PII, ability to blacklist specific clients whenever needed and so on thereby controlling and securing the data as much as possible from your end considering that security must be a first-class citizen and not an after thought.

Step 9: Testing

Testing plays a major role especially when building scalable systems which have to be reliable even with a huge stream of changes which might affect it over time. And this is no exception when you work with GraphQL as well. You can setup automated tests, integration tests and so on as you normally would to improve the confidence people have on the system. And there are lot of libraries which facilitate the same as well like Mocha, Jest, AVA and so on taking a lot of the burden away from you.

You can test your resolvers, your GraphQL endpoint, your schema and so on. Testing can not just improve the reliability of your code, but can also act as a secondary source of documentation for people who are looking to understand what every function is doing and how to use it as part of their workflow. So, doing this as you go along can help.

Step 10: Automating or Scripting the repeatable parts

When you work on GraphQL or anything else for that matter, there is often a set of operations which you repeatedly do again and again. And over time, the cost of doing it would exponentially grow up.

For eg. you might push your schema to a registry if you use something like federation for all the changes you do, validate/lint your schema, do code generation as you change the SDL or anything else which is specific to your use case. This is where automation and scripting plays a major role and I can definitely say that I have saved countless hours of my valuable time by just scripting out things which I do repeatedly as part of my workflow.

Automation, especially with CI/CD becomes even more impactful when you are working with teams. There are a lot of interesting things you can do in your CI/CD pipeline like linting your schema and validating it, getting the list of breaking changes, pushing it to the registry, running automated tests, sending notifications to relevant people in your team as needed and so on saving a lot of time and also providing a high degree of reliability and confidence to what you ship to production.

This summarizes the most important steps you need to perform as part of your workflow with GraphQL and there are somethings which have been purposefully avoided in this list like setting up your infrastructure for file uploads, enabling real-time data exchange with subscriptions/live queries and so on since it all depends on your use case at hand but if you are interested in those, do have a look at our previous blog posts where we discuss about various tools and libraries which can help you with it.

While all of this may seem overwhelming, you need not boil yourself doing it all when you start but rather do it incrementally as you go along.

But, I am not using JavaScript/Typescript

While this series addresses most of the questions with examples from Javascript and Typescript, you must take into note that Javascript / Typescript is not the only language which GraphQL is compatible with since it is language independent. And you can always draw parallels in other languages as well. If you find yourself working in other languages, this might help or if you are looking for tutorials, there is a good catalog here and as we discussed before, the ecosystem is too huge and growing with more like this cropping up everyday.

Concluding...

As all good things come to an end, this blog would be the last of this GraphQL series. But if you are looking for something specific which we have not addressed in this series, do let us know and maybe we can do a follow up blog post or even add it to this series if it makes sense. The reason we conclude here is that we intend to keep this series as a guide rather than a tutorial series since there is a lot of information already out there regarding the various tools, libraries and frameworks we talked about in this series.

But rest assured, we will definitely have a lot of blogs like these in the future as we work with GraphQL more and more and we also intend to provide you with a case study on how we do all of this at Timecampus sometime down the line. Do stick around for that. But in the meantime, there are a lot of other blog posts like these with blogs, videos and books from the community which is really worth checking out.

Also, I intend to keep the blog posts in this series as living documents rather than one-off blog posts. Hence, you might find us updating the information shared if needed over time.

If you are working your way through GraphQL and if this series really did help you in your path, we would love to know your story. GraphQL is where it is today because of people like you, the community, and I am very positive about its present and future especially in a data driven world and the journey towards bringing about a semantic web.

If you have any questions or are looking for help, feel free to reach out to me @techahoy anytime.

And if this helped, do share this across with your friends, do hang around and follow us for more like this every week. See you all soon.

GraphQL - The Stack #3

t.v.vignesh — Sun, 06 Dec 2020 02:19:07 +0000

This blog is a part of a series on GraphQL where we will dive deep into GraphQL and its ecosystem one piece at a time

In this series, we had looked at some of the interesting parts of the GraphQL stack so far with a range of tools, libraries and frameworks from the community.
Let us continue the journey in this blog looking at more such tools and services which have created a great impact in the GraphQL ecosystem.

GraphiQL

The evolution of GraphQL clients have been really amazing and I would say, this is one of the great things about GraphQL given its powerful introspection capabilities,
being self documenting and also providing ability to extend everything with extensions.

It all started with GraphiQL demonstrating all these back in the day, but then came Playground (which had recently merged with the GraphiQL team to make things even more interesting),
Altair and even desktop/web/editor based clients like Insomnia, Postman, Hoppscotch, VSCode Rest Client and the list
goes on all proving that the developer experience with GraphQL can be made really better with just some sugar on top.

But, some of the reasons why thinking about the future of GraphiQL feels really great is cause of the upcoming support for Monaco mode ,
support for plugins and a lot of amazing features from Playground to now become as part of GraphiQL as part of the transition according to the blog linked above.

Also, embedding a GraphiQL editor is as simple as importing the HTML and related assets as specified in their README.

And while the user experience is made as simple as possible, there are a huge number of components which make it all happen behind the scenes as mentioned in the README and you can have a look at
all of them in the monorepo here and here.

Source: GraphiQL

Codemirror used to be the interface which used to provide the editor support for GraphiQL,
Playground, Insomnia and other editors in the ecosystem in 1.x which is now being succeeded by the language service
which takes care of providing a web/desktop IDE experience if you are using editors like VSCode, Language Parser which takes care of parsing the GraphQL SDL and operations you write and convert them to GraphQL AST (If you are curious about how the AST looks, like, you can try going to ASTExplorer
select GraphQL, enter your operation and have a look at the AST which is what the final representation will look like) and so on becoming a platform for not just GraphiQL
but the entire editor ecosystem.

GraphiQL Explorer

Starting with GraphQL or GraphiQL may actually have a learning curve for beginners since it takes a different approach to dealing with data. And even after people settle down with GraphQL, some people do feel like life was better for them when they were using something as simple as REST or GRPC.

This is where tools like GraphiQL Explorer play a major role where all their queries and mutations can be constructed automatically just by checking all the fields you need from the schema.

This workflow feels intuitive since it is as simple as checking all the fields you need in your client. You can read about how Onegraph solves this problem here

It is just a React component which you include with your GraphiQL instance and the rest is history.

GraphQL Voyager

The next beautiful tool I would talk about here is the GraphQL Voyager. Infact, this is the first tool I used when I was new to GraphQL few years back and it drove me nuts seeing the potential of
what GraphQL can do.

The reason this is great is because, this leverages the complete power of introspection from GraphQL. You get to see all the entities and how they are related,
search through the schema and also browse the docs

Source: GraphQL Voyager

And today, GraphQL Editor takes this one step further allowing you to view, edit, browse all the entities and
hierarchy making it really a great tool for anyone who wants to quickly work through the schema.

GraphQL Upload

One important thing which GraphQL Spec did not discuss about is a way to transmit files over the wire when using GraphQL. This is where GraphQL Upload comes in.
While not an official spec from GraphQL foundation, Jayden had done a great job to put together a multi part spec
to address exactly this problem.

GraphQL Upload is the library which provides a great implementation of this spec with an ability to work with various frameworks. One thing to remember is that,
while GraphQL Upload definitely does the job and works well over a significant scale, you might want to stick to HTTP for higher production workloads
because of the reasons outlined in this blog.

And if you are using something like a GraphQL Gateway with either federation or stitching, you might want to make sure that you don't overload the gateway transmitting files
creating probable bottlenecks which can affect the rest of your requests. So, try striking a balance since GraphQL need not be a solution for every problem.

GraphQL WS

Subscriptions are a powerful part of GraphQL allowing you to track all the operations happening with the data in near-real time but this mandates the use of a protocol like
websockets or use something like Server Sent Events (SSE).

While subscription-transport-ws from Apollo initially started off this journey, it is not actively maintained
and GraphQL WS by Denis definitely is a great replacement to that having no external dependencies and having the ability to work across many frameworks.

But do remember that, websocket might loose its lead in the future especially with the introduction of HTTP/2 and HTTP/3 as mentioned here while definitely here to stay. But this wouldn't affect GraphQL in any way since its transport independent.

Also note that subscriptions are not the only way to do real time communications in GraphQL. There are also things like Live Queries with great libraries like this from Laurin which you can use

Apollo Federation

While Schema Stitching was initially advocated by Apollo with introduction of many helper functions in GraphQL Tools, their direction did change soon after hearing a lot of feedback from their customers and took their call to introduce Apollo Federation. You can read their reasoning in this blog but this does not mean that stitching has lost its relevance especially with the introduction of Type Merging.

Apollo Federation does a great job especially when you use it with the rest of the ecosystem from Apollo like the Apollo Studio. Apollo Stack does offer a lot of features which might be relevant to working with a data graph in an organization starting from providing a registry where you can upload parts of the combined schema from all services, version control the changes to your schema validating breaking changes, providing metrics regarding all the clients consuming the schema, tracing of all operations, multiple variants to manage multiple environments, alerting across multiple channels, and a CLI to work with all of these.

And this can definitely help teams who want to maintain their own part of the schema.

Federation comes with its own specification and directives as part of it which helps people to define all of the relations between multiple GraphQL entities so that the Apollo Gateway can combine them all together without having to modify the GraphQL gateway and also functions like __resolveReference which helps in resolving an entity with its reference as specified by the directives.

The Apollo CLI when combined with Federation does come with a lot of helpers to take care of things like pushing the schema, listing the services in the studio, doing codegen and so on though I am not currently sure why they are rewriting it again to Rust apart from the reasons as suggested here.

Let's quickly look at how Apollo Studio lets you manage the schema

This is how you maintain multiple Data graphs in your organization across environments

Browse through the schema, its types, documentation and so on

Track the changelog of your schema over time

Browse through the SDL of your schema

Execute GraphQL operations against your schema

and does offer a lot more especially when you are a paying customer.

NOTE: Federation with Apollo Server does not support subscriptions yet and you might want to stick with stitching if you are looking for subscriptions support or switch to some other server like Mercurius since it does allow subscriptions over federation.

Gatsby

Gatsby is a static site generator powered by React, GraphQL and a lot of plugins contributed by the community which helps you sites simply by pooling in data from multiple different sources in multiple different ways and it really popularized the idea of doing this all via GraphQL. If you want to know why and how Gatsby uses GraphQL, you can give this a read. And while Gatsby does offer both Server Side Rendering and Static Site Generation, I would say it all boils down to your usecase.

While Gatsby did popularize the idea of using GraphQL for static sites, there are a lot of other static site generators out there like Eleventy, Jekyll, Hugo, etc. and I find myself personally aligning towards Eleventy because of quite a few reasons which may not be right for this blog. But if you are curious, you can read blogs like this and this which gives a comparison.

Opentelemetry - GraphQL

Opentelemetry is the new standard for instrumentation (especially after Open Tracing and Open Census merging together) and this makes things really amazing for people since there were quite a few overlap before in between them which can now be avoided to bring about a powerful tracing standard.

Opentelemetry is not specific to any language or implementation and you can find all the amazing projects from Open Telemetry hosted here

Now, the exciting thing is that there is now a reference implementation to the same using GraphQL which you can find here and also an example to help you out with the same here

This when used with Jaeger, Zipkin or Tempo can provide you with Traces for your GraphQL operations which you can track across your resolvers. Do note that it is not advisable to be turned on for everything since it has a performance overhead.

This can give you a context on how your data and context flow irrespective of your architecture in your resolvers and functions.

GraphQL Faker

Faker.js has been a great project to quickly generate mock or sample data providing various types of entities inbuilt. For eg. you can generate random addresses, images, URLs and so on, helping you to quickly test out your application without relying on the server or the backend to hold data.

This has become even more amazing with GraphQL Faker since it allows you to use all the great things which Faker provides you with directives. Just define what data you want a specific field to generate by specifying the relevant directives and GraphQL Faker can actually generate all the data for you using Faker.js

Source: GraphQL Faker

If you are using @graphql-tools you can also use faker.js directly and combine it with Mocking to get similar results, but without the need to change your SDL.

While there are a lot of other tools we can discuss, the GraphQL ecosystem is huge and this pretty much has no end. But I do presume that these are all the tools you mainly need to start your GraphQL journey and leverage the ecosystem in the best way possible.

But with this the GraphQL journey is still not over. We will continue the next blog discussing a few more interesting things as part of the GraphQL series.

Is there anything you would like to see me address in this series? Do let me know and we can probably do that in an another post.

If you have any questions or are looking for help, feel free to reach out to me @techahoy anytime.

And if this helped, do share this across with your friends, do hang around and follow us for more like this every week. See you all soon.

GraphQL - The Stack #2

t.v.vignesh — Sun, 06 Dec 2020 02:13:27 +0000

This blog is a part of a series on GraphQL where we will dive deep into GraphQL and its ecosystem one piece at a time

In the previous blog, we had started going through "The GraphQL Stack" that we use at Timecampus going through various libraries and tools like VSCode, GraphQL Config, VSCode GraphQL, GraphQL ESLint, GraphQL Inspector, Typescript, GraphQL Helix and GraphQL Codegen. In this blog, we will continue our journey exploring from where we left off.

Before we continue, one thing I have to say is that the GraphQL ecosystem is so huge and growing that it is not feasible to look at everything available out there in this GraphQL series, but one thing we are sure of is that, this can indeed put you a few steps ahead in your journey with GraphQL and its ecosystem. With that disclaimer, let's start.

GraphQL Modules

As we have discussed before, GraphQL does act as a single entry point for all your data giving a unified data graph which can be consumed by any client which is really powerful. But this does not mean that you have to mix up all your code in one place making it really difficult to manage.

As people have already found, both Microservices and Monolithic architectures comes with its own set of advantages and challenges and what you go for completely depends on your use case, the scale you need, your team and talent pool.

But this does not mean that you should not keep your application non-modular irrespective of the architecture you go for. Having clear responsibilities, separation of concerns and decomposing your application into modules gives you great flexibility, power and makes your application less error prone because you just do one thing, but you do it well.

Now, this is where GraphQL Modules really comes in. Yes, you can have your own way of organizing the code, your own way to pull in the schemas, your own set of tools and so on, but you don't have to reinvent every wheel there is.

It helps you decompose your schema, resolvers, types and context into smaller modules with each module being completely isolated from each other, yet being able to talk to each other. And this becomes even more powerful as you scale since it comes with concepts like Dependency Injection allowing you to specify your own providers, tokens, scope and so on.

NOTE: GraphQL Modules overrides the execute call from graphql-js to do all its work. So, make sure that the GraphQL server you use allows you to override it.

At Timecampus, we use a microservices architecture, and every microservice is essentially a monorepo (PNPM Workspaces) by itself covering a specific Domain. For instance, this is how portion of my directory structure looks like. If you notice, I am able to split every Microservice into multiple modules like this which allows me to manage the code better.

And this is how a simple provider looks like. If you notice, this makes it very simple to comprehend. The convention I use is that, I try to group CRUD operations into a single module but it need not call for a separate microservice all by itself.

And your Mutations become as simple as this, calling the injector, doing the operations and returning the results:

And finally all you have to do is compose the schema and resolvers from all the modules in your server giving a unified GraphQL endpoint you can use.

Now, this becomes even more powerful if you use the GraphQL Modules Preset with Codegen since it essentially also splits your types and generates types for each GraphQL Module making things even more organized and isolated.

There is a lot more that we can explore, but I will leave it at this.

GraphQL Mesh

What if you can use GraphQL to do all your operations even when your backend systems, datasources and the services do not understand GraphQL natively and without spending time converting them to GraphQL endpoints? And what if you can aggregate and mesh all of them together with GraphQL? This is where GraphQL Mesh really comes into picture.

GraphQL Mesh acts as an abstraction layer which can interface with multiple different types of backends like REST, SOAP, GraphQL, GRPC, OData, Thrift and even databases like MySQL, Neo4j and so on as documented here.

All you need to do is provide a config file .meshrc.yaml and it will generate everything for you and the execution engine will take care of converting your GraphQL queries to native backend specific queries.

Think of GraphQL Mesh like a universal ORM not limited to just databases but any data source or service which produces data and has an execution layer for performing operations on them.

For eg. you can pass in your OpenAPI spec, and GraphQL Mesh will generate all the necessary things for you to provide a GraphQL schema which you can use.

At first, I had to think a bit to see whether GraphQL Mesh is relevant to me, cause my stack completely uses GraphQL natively anyways (including my data source Dgraph which supports GraphQL Natively) and hence was not sure if it suited my use case.

But the more I thought about it, I started seeing GraphQL Mesh as an abstraction layer which will make my stack future-proof irrespective of all the data sources or backends I may add in the future. And the beauty of it is, there are a lot of ways in which you can use the Mesh (as a separate service, as a SDK with your service or as a gateway).

I personally use GraphQL Mesh as a SDK with my services to access the backend data sources running GraphQL thereby avoiding any bottlenecks if any. And the added advantage you get here is that it makes all the operations you do fully typed.

Since I am just in the initial phases of development, this is how my my .meshrc file looks like where I interface with Dgraph with GraphQL Mesh

And when I have the SDK generated with GraphQL Mesh, all I have to do is just use the methods the SDK providers me (based on the GraphQL Mutations and Queries I have provided to it as inputs) like this:

Which makes it really powerful to use without worrying about what happens underneath. While there is a lot we can talk about GraphQL Mesh as well, I will leave it at this for now.

GraphQL Tools

When you talk about GraphQL, one simply cannot forget GraphQL Tools irrespective of the architecture or stack you use. Initially developed by Apollo and then taken over by The Guild, GraphQL Tools provides you a very powerful set of utility functions to work with GraphQL which you can use in your services irrespective of whethere you are using something like Apollo Federation or Schema Stitching.

It provides you a lot of utility functions which can help you do things like loading a remote GraphQL schema, merge schemas, mock schema with test data, stitch schemas together with either Type Merging or Schema extensions, enables you to write GraphQL schema directives and the list goes on.

And since it is available as scoped packages @graphql-tools you can just import only the modules you want and use it without adding any bloat.

The reason GraphQL Tools shines is because, it stops you from reinventing the wheel, helping you focus on the other things which really matter the most in your journey with GraphQL. For eg. if you see below, I use the functions from GraphQL Tools extensively when I do operations with my schema like this:

And it also helps me write my own directives like this:

And since I have recently moved from Federation to Stitching, I am also starting to use Typemerging from GraphQL Tools to have my GraphQL Gateway setup as well like this:

If you are new to schema stitching with Typemerging, I would recommend you check out this repository from Greg where he does a great job of explaining all the concepts.

Typed Document Node

Typed Document Node holds a special place in my heart cause it was only after coming across this project that I started understanding the power of marrying GraphQL and Typescript together (I had ignored Codegen and all the related tooling before coming across this since I did not understand the importance of it back then).

Typed Document Node does a simple job of converting your GraphQL documents to Typescript DocumentNode objects irrespective of whether it is a query, mutation, subscription or fragment. You can have Codegen generate all the Typed Document Node types for you when you work.

And the reason it is really good is cause, it works well with other libraries like @apollo/client where you can pass a TypedDocumentNode object generated from your GraphQL operations and the results will also be fully typed, thus helping you to stop worrying about manually typing your GraphQL requests.

For eg. this is how I use TypedDocumentNode to have all my GraphQL operations typed when calling @apollo/client/core in my app.

All I had to do is pass the document which was generated and if you notice, even my response is fully typed.

And this is how the generated Document Nodes look like:

Initially I had it running on both the server and the client side but then removed it from the server side since the SDK from GraphQL Mesh was already doing this job for me.

There are also plugins like TypeScript GraphQL-Request available when using Codegen which generates a SDK out of GraphQL operations. While I haven't tried it, I did not opt for it cause I did not want to get coupled to the graphql-request library, and also this was fitting my use case pretty well.

Dgraph

(Watch from 25:43 for my talk on Dgraph)

While Dgraph is not necessarily relevant to anyone and everyone and definitely not for legacy systems, it is of real relevance and significance for us as we work on Timecampus. Dgraph is a scalable and distributed Graph database written in Golang which understands GraphQL natively (while it also has its own query language as well called DQL which is a modification of the GraphQL spec to support database specific optimizations).

As I was building the product, I started off with Postgres with Prisma as my ORM. But as I thought more and more and was writing code, I started noticing a few things.

All the entities were increasingly getting connected to each other to various kinds of relationships
Initially I was paranoid and I had a single Postgres database instance for every microservice following the microservices architecture conventions and thus I was left with isolated pools of datasets which led me to manually do a lot of cross-service calls to get data from the other databases incase I wanted to relate them
I had to clearly know which database instance had a respective schema before even making the call from a service. Hence, things were no longer an implementation detail
Since I was using Prisma with Postgres (and believe me, Prisma was really amazing to work with), I also had to manage things like Migrations, rolling them back and forth and also do this in the CI/CD pipelines which was adding more complexity

Now, there were a lot of other challenges I was facing other than this, but a few things I quickly realized is that:

Almost all the data is connected in some way or the other (or at least the majority was)
Splitting databases to multiple isolated instances per microservice was just adding more and more complexity and the effort was not worth according to me
A database like Postgres (or even other like MySQL, MSSQL) was not originally designed for a microservices-like architecture (while it definitely works well with it). This makes things like horizontal scaling across multiple nodes difficult to do (while definitely possible with hacks)
Also, since I ran my entire stack on Kubernetes, I was also looking for a database with Cloud Native support

While I was aware of Graph databases before, a lot of the Graph databases are meant just for storing the edges and vertices (i.e. the relationships between various nodes) and traversing through them but does not have support for storing the data in itself for which I have to opt in for another database to read/write the data. This adds a lot of complexity to everything and you have to keep both in sync as well which makes it really hard to do.

Now, Dgraph solves all these problems (and the awesome part as I already told you is that it supports GraphQL natively which gives me the ability to use all the GraphQL tools with it) .

While they also offer a hosted solution called Slash GraphQL, I opted in for hosting Dgraph Open Source on my own since I wanted to support any environment be it hybrid cloud or on premise, wanted to have the data as close to me as possible to offer compliance.

Since it exposes a GraphQL endpoint, I also run the Mesh SDK/Codegen on it and it gives me completely typed database operations with the SDK as I mentioned above.

And the only tool I need to interact with it is a GraphQL client like Insomnia or VSCode Rest Client (While it does expose its own client called Ratel for doing DQL operations and managing the database). Moreover, the database schema is nothing but a GraphQL schema. So, I had no learning curve as well.

And another beautiful thing I liked about it is that, I need not worry about scalability anymore since it can be horizontally distributed, across multiple nodes or containers in my Kubernetes Cluster and scaled up/down and it can handle everything exposing a single GraphQL endpoint without me having to setup a single database per microservice.

A single Graph Database instance per microservice did not make sense for me since it will effectively split the Graph into multiple pieces and the whole point of having a completely connected database graph would be lost.

Also, the feature set was quite promising when comparing other graph databases and the benchmarks were also quite promising when comparing the likes of Neo4j, but there is definitely a counter argument for that.

But the reason I find Dgraph appealing more is cause the underlying store is Badger which is made using Golang and hence does come with its own set of advantages and performance gains. On top of this, Dgraph is not the only store which uses badger which makes it even more exciting to use.

Disclaimer: I don't have experience running Dgraph in production (since we are on our way to launch), but there are definitely others who have done it.

Now the reason, I added Dgraph to this stack was that Dgraph offers a great GraphQL native solution for databases. But if you are looking to go for Neo4j, it does offer a GraphQL adapter too.

Well, the discussion doesn't end here and there is a lot more we can talk about with respect to GraphQL and its ecosystem. We will continue in the next blog post. Hope this was insightful.

If you have any questions or are looking for help, feel free to reach out to me @techahoy anytime.

And if this helped, do share this across with your friends, do hang around and follow us for more like this every week. See you all soon.

GraphQL - The Stack #1

t.v.vignesh — Sun, 06 Dec 2020 02:04:17 +0000

This blog is a part of a series on GraphQL where we will dive deep into GraphQL and its ecosystem one piece at a time

Now that we have discussed about GraphQL, and also about some of the architectural considerations when starting off, let’s look at the next important step in the puzzle — choosing the right tech stack for your usecase and building the development workflow which suits you best in this blog.

Technology changes and evolves constantly as we have already seen it happening all these days. So, rather than worrying too much about the technology you choose, it is better to choose a tool, library or platform which allows for incremental changes without lockin. Using the list in the previous blog post might actually help in your decision making process.

But, today I am going to assume a tech stack (the GraphQL Tech Stack that I work with everyday to build Timecampus) and walk you through. The reason I say “GraphQL” Tech Stack is because, this is just a part of the complete stack I use and there is more to it which we will discuss sometime down the line in a different blog.

NOTE: While these work great for me, this is an area of continuous exploration for me and I don’t mind replacing X with Y as long as the effort is really worth it from a future perspective (we will explore more on what they are and why we use these as we go along). With that, let’s start.

VSCode

There is no doubt that VSCode has become the defacto editor which developers user these days. And it definitely deserves the recognition and credit it gets. VSCode comes with amazing extensions and tooling for GraphQL and its ecosystem built by the community and if you work with GraphQL and Typescript, I would say it is pretty much a standard editor which you would definitely want to use.

For instance, just do a search for “GraphQL” in the marketplace, and this is what you get:

and the ecosystem is growing even more everyday and this makes VSCode indispensable for our stack.

GraphQL Config

GraphQL Config acts as a single configuration point for all that we do with GraphQL. This is important because when working on projects, it is important to have little to no repetition (DRY principle) and having a separate config file for every tool will start getting overwhelming and messy over time since we will have multiple places to maintain.

We can specify all that we want regarding GraphQL in a single .graphqlrc file as mentioned in the docs starting from the location to the schema, the GraphQL documents (queries and mutations), and also the configuration for extensions which we use with it.

Not just this, a single .graphqlrc file can be used to specify all the configuration you need for multiple projects that you use in your workspace.

For eg. it can integrate with our VSCode GraphQL extension to provide autocompletion, intellisense and so on, provide all the config needed to do code generation with GraphQL codegen, linting with GraphQL ESLint and can also pave way to all the other tools we may integrate in the future.

A .graphqlrc.yml file may look something like this:

GraphQL Config Snippet

VSCode GraphQL

The next thing which comes to mind is a VSCode extension which can provide the support for all the things you need to do with GraphQL. Originally developed by the amazing people at Prisma this extension was later donated to the GraphQL Foundation and the reason this extension is really promising is because, it provides everything you need to work with GraphQL including syntax highlighting, autocompletion, validation, SDL navigation, execute, operations, support for tagged template literals and all of this with support for GraphQL Config and it works great.

NOTE: If you are using the Apollo Stack (like Federation), I would recommend you to go with Apollo VSCode instead since it provides support for things like apollo.config.js (which integrates with the schema registry), federation directives and so on.

GraphQL ESLint

The next thing which is important when you work with GraphQL as a team is following a set of standards so that everyone is on the same page. This is where using a linter like GraphQL ESLint would really help. The beauty is that it integrates seamlessly with GraphQL Config, supports ESLint natively and also provides some inbuilt rules which is a great start to work with like consistent case, making naming of operations mandatory, forcing a deprecation reason and so on which can be of great use as you scale up with GraphQL.

A sample .eslintrc file to be used for GraphQL ESLint would look something like this:

GraphQL ESLint snippet

GraphQL Inspector

How do you make collaborating with GraphQL very easy? And how do you do this in such a way that you have all the information you need to take a specific action? What if there are breaking changes to your schema? Errors and issues may creep in anywhere and at anytime.

This is where GraphQL inspector comes in. It provides a platform with various functionalities like schema validation, coverage, finding similar operations, inspecting the difference between different versions of the schema, mock your schema with test data and also a Github application to do all this for you when you raise a pull request.

For eg. this is how finding the coverage of your operations against the schema looks like:

GraphQL Coverage

And if you want to find similar fields/types within your schema, this is how it will look like:

GraphQL Similarity

Typescript

When I initially started off with Typescript few years ago, I was not sure of the advantages it would provide me over time for the effort I am putting in to make the code I write completely typed. To be honest, it takes a lot of effort and sometimes can be painful. But, this perception changed over time especially when I started working with GraphQL and Typescript.

The reason GraphQL works great with Typescript is mainly because of a lot of similarities between them with both being strongly typed, providing a clear path to documentation, offering great validations and also a great ecosystem built both on top of Typescript and GraphQL.

This will become more evident as we go through this blog. But, writing the types manually for each and every field in the schema or for every operation and keeping them updated can be a huge task. This is where a lot of amazing tools come in like GraphQL Codegen, Typed Document Node, Typegraphql and so on.

And on top of this, the beauty is that, with GraphQL and Typescript, we can actually make the end-end stack fully typed (which is what we do at Timecampus). And after seeing all this happening, even graphql-js is on its path to migration with Typescript.

Graphql Helix

There are a lot of GraphQL servers out there. And we even spoke about some of those in our first blog post. While it is not necessary to pick an out of the box GraphQL server since you can build your own using graphql-js , it may not be a smart choice since you might not want to reinvent the wheel.

This is where I use GraphQL Helix which provides me a GraphQL server and also the option to selectively replace any module that I need to work for your usecase. This is very evident from the examples folder of the repository demonstrating various usecases like subscriptions, csp, graphql-modules, persisted-queries and so on and also with various frameworks like express, fastify, koa.

And since there are no outside dependencies except for graphql-js there is also no bloat to the same unlike other graphql servers. If you want to see how other GraphQL servers perform, you might want to have a look at this.

GraphQL Codegen

We did discuss how Typescript and GraphQL works seamlessly well with each other. But what if we can generate all that we can from our SDL which provides majority of the information that one needs including name of the schema, fields, types, and so on.

And this is where GraphQL Codegen plays a major role. You can generate all the types, interfaces and so on and it also comes with a lot of plugins and presets to help you work with not just Typescript, but also other languages and tooling. All we have to do is import the type we need and just use it making it really simple. And every time we change the schema, we can just regenerate the types. Also, it integrates seamlessly with GraphQL Config making it really easy to maintain.

For eg. this is how the generated types look like:

There are more tools, libraries and platforms we have to talk about as part of our GraphQL Stack and we will be continuing our discussion in the next blog post. Hope this was insightful.

If you have any questions or are looking for help, feel free to reach out to me @techahoy anytime.

And if this helped, do share this across with your friends, do hang around and follow us for more like this every week. See you all soon.

GraphQL - Usecase and Architecture

t.v.vignesh — Sun, 06 Dec 2020 01:56:14 +0000

This blog is a part of a series on GraphQL where we will dive deep into GraphQL and its ecosystem one piece at a time

In the last blog post, we explored the various questions one might have when starting off or working with the GraphQL ecosystem and answered them. Now that justice has been done to clear the clouded thoughts you might have, let’s dive into the next important step in this blog.

In this blog, we will start looking at how your architecture can look like when working with GraphQL and its ecosystem.

The Architecture

Your architecture hugely revolves around your usecase and you have to be very careful in getting it right and take proper consultation if needed from experts. While it is very important to get it right before you start, mistakes can happen, and with a lot of research happening these days, you can often find any revolution happen any day which can make your old way of thinking obsolete.

That is why, I would highly recommend you to Architect for Change and make your architecture as Modular as possible so that you have the flexibility to do incremental changes in the future if needed. Let’s just talk about architecture in context with GraphQL here. We will explore more deeper into the rest of the architecture in an another blog post.

The Basics

There are some things you would have to think of before starting your journey.

Am I building a monolith or am I working on microservices? Remember that monoliths still have a huge place in today’s world given the complexity which comes with Microservices as long as your project is small.
What does my deployment target going to look like? VM, Containers or Bare Metal?
What is going to be my orchestration layer? Kubernetes, Mesos, Swarm or OpenStack?
What are my scaling needs?
What is the performance that I expect?
Do I need Offline support?
Cloud or On-Premise?
What is the programming language which makes sense for my usecase?

This list is incomplete. There are more questions like these which you might want to answer yourself and answering this can give you a lot of clarity as you start building your architecture.

The Ingress / Load Balancer

This is the first layer that any client would typically hit before making requests to your GraphQL service. This acts as the single entry point for all traffic (it can be regional as well depending on your usecase).

This would be the first thing you would have to setup before getting started and this is also the layer which handles things like SSL termination, caching (in case you have a CDN setup) and so on.

If you are in the Kubernetes world, you also have a lot of ingress controllers like Nginx Ingress, Ambassador, Kong, Contour and so on which can help.

The API Gateway

The first thing would be the entry point of all your GraphQL requests. Since GraphQL exposes a single endpoint eg. /graphql this becomes the single entry point for all your operations.

But, I highly wouldn’t recommend directly exposing your service to client since it can be unsecure, difficult to manage things like rate-limiting, load balancing and so on.

Rather, it is always recommended to expose it via an API Gateway of your choice. Be it Ambassador, Kong, WSO2, Apigee or anything else for that matter. This can also act as sort of a kill switch or can also be used for things like filtering and moderating traffic whenever needed.

The GraphQL Gateway

As you evolve, you might end up having multiple services or might even move to the microservices world to enable scale. Now, this means multiple services with its own GraphQL schema, logic and so on.

But unlike REST, GraphQL exposes a single endpoint irrespective of the underlying services. This is where a Gateway plays a major role and comes in at the next layer of our architecture. The role of orchestrating or composing (both are different) multiple services and schemas together, delegating queries and mutations to the respective microservices and all of this without the client having to worry about the complexity underneath.

While you may choose to go for different architectures like Schema Stitching or Federation depending on your usecase, do remember that sometimes, this may be an overkill. You might not even need a GraphQL Gateway to start with if you are building something small and this can reduce a lot of complexity.

The GraphQL Service

The next thing to think of would be the GraphQL service itself (be it a monolith or microservice). Each service would be responsible for a part of the complete data graph as seen in Federated Implementation and this will make things easier to scale. Note that the way you implement it can be different as discussed (Schema Stitching or Federation).

You might also want to modularize your project structure and code within the service and this is applicable irrespective of whether you use a monolith or microservice to maintain clear separation of concerns, make everything composable and modular as possible.

While you can end up discovering your own way to do it (I initially went down this path), but what is the use of re-inventing the wheel when you have something like GraphQL Modules which can help you with this.

You might also want to get your tooling right to reduce as much work you do as possible. Be it linting and validation, code generation, testing, and so on so that you automate most of your workflow and you stay productive while working on any part of the service.

The Mode of Communication

Now that you have thought about the service(s), you might also want to think about the mode of communication in between them which is essential to pass data to and fro, synchronously and asynchronously. This also presents some questions which you might want to answer first before starting.

https (1.1, 2 or 3) or grpc (over http/2) or Thrift or Websockets?
Do you need a Service Mesh?
Is GraphQL going to be used for communicating between services?
Do I need something like MTLS for securing inter-service communication?
How do I do asynchronous communication? Do I use event queues like Kafka, RabbitMQ or NATS ?

Again, all of these depend on your usecase and hence, there is no definite answer to this. But, try to go for a protocol which offers you less latency, great compatibility with built in support for things like compression, encryption and so on.

This matters cause while all the clients would communicate with the GraphQL endpoint you expose, you still would have to have some sort of efficient way to do inter-service communication.

Even if you are going to communicate between your service with GraphQL (which is what I do), you still have to decide how you transmit the GraphQL queries and mutations in between them.

Authentication & Control

Like we discussed in the previous blog post, there are various ways to do authentication and authorization. You might want to consider them as well while architecting cause this will decide how chatty your services will be when doing operations, how secure will it be, and so on. There are various ways as we spoke about, both stateful and stateless. While stateless would be better for scalability, you might want to choose what works best for you.

Depending on your usecase, you might also want to decide if you need something like persisted queries or not. This can prevent clients from sending queries which are not authorized, prevent huge amounts of GraphQL data from being passed over the wire, and so on.

The Backend

And then comes the backend which you are going to use to store/retrieve data from. There are a huge number of options out there and to be honest, there is no one database which fits all usecases. And they even come with different variants — SQL, NoSQL, Search, Time Series and even Graph Databases. You can refer DBEngines for a complete list.

And you can even put a GraphQL layer or ORM on top of all of them if you want and take the complexity away from the services (eg. with Prisma 2 or GraphQL Mesh).

You might also want to look at how you minimize the amount of calls you make to the main database. Do you need caching and have it setup? Have you addressed the N+1 problem with Dataloader?

More Exploration

Now, there are a lot of other things you might want to have in your architecture like Hybrid Cloud support, CI/CD pipelines, caching and so on. We will probably explore them in future blog posts as we go along.

Remember to keep your stack as simple as possible and you can incrementally have them setup as you go along.

Some Tips

When architecting applications, I try to use the Black Box model as much as possible. This simplifies a lot of things for me.
I try to go for the Zero Trust Security Model when building my architecture popularized by Beyondcorp from Google and while this will create a lot of friction at start, this makes life a lot better for you in the future.
There are some questions I ask based on the principles like YAGNI, DRY, KISS and they play a huge role in making sure that you don’t overwhelm yourself with things you don’t want to do right now and prioritize things right.
I try to refer case studies and see how others are already solving the same problem and this can help me save a lot of my time. Avoiding to re-invent the wheel. For GraphQL, you may find them here

Deciding the “Right” Stack for “You”

Before I pick any tool or technology as part of my tech stack, I do ask a set of questions which help me better judge and make an informed decision on what I want. Probably it might help you too. This applies not just to the GraphQL ecosystem, but anything you choose for that matter.

Does this tool/library solve my problem well?
What is the Licensing model? Is it Open Source? If so, is it MIT/Apache/BSD/GPL
Does it have community support or backed by a Foundation/Enterprise? When was the last commit? How many contributors? Does it have a clear path to becoming contributors?
How many people use it in production? What are their experiences? At what scale are they using it?
What does the stats look like? Stars, Forks, Downloads?
Is it bloated? Or does it do just one thing well?
Does it have a clear roadmap for the future? If so, what are the milestones?
What are the other alternatives? How does it compare to them?
How is the documentation? Does it have tests? Does it have examples which I can refer to?
Does it follow standards and is free of Vendor Lockin?
Are there any security concerns which this tool or library might create?

While not all of these questions might have been addressed by the library or tool well, what I see is atleast the intent to address them in near-time.

While most of the things in this blog may not be related to GraphQL itself, these are some things which you need to keep in mind before starting your journey with GraphQL. In the next blog, I will show you how my GraphQL Tech Stack looks like as I use it to build Timecampus and we will dive deeper into each layer of the stack, one piece at a time.

Hope this was informative. Do let us know how you prefer to architect with GraphQL in the comments below and we will be happy to know more about it.

If you have any questions or are looking for help, feel free to reach out to me @techahoy anytime.

And if this helped, do share this across with your friends, do hang around and follow us for more like this every week. See you all soon.

GraphQL - Diving Deep

t.v.vignesh — Sun, 06 Dec 2020 01:50:56 +0000

This blog is a part of a series on GraphQL where we will dive deep into GraphQL and its ecosystem one piece at a time

The GraphQL specification was open sourced in 2015 by Facebook along with some basic implementations with a completely unique approach on how to structure, consume, transmit and process data and data graphs.

Today, the GraphQL spec and its implementations have been donated by Facebook to the GraphQL Foundation with open license for development and governance from the community and it has been great so far.
And today, the GraphQL foundation comprises not just of companies like Facebook but other organizational members as well.

It was a moment when a lot of people were convinced by its power, utility and promise that the rest became history.

And today, there is a GraphQL foundation which tries to ensure that GraphQL and the ecosystem thrives over time, a huge landscape of projects, a huge set of tools like this and this
and these can just be few of the examples on how big the ecosystem has grown with a lot of languages, frameworks, tools supporting it as a first class citizen, so much so that even some of the huge enterprises are using it today as part of their stack.

GraphQL is at our heart at Timecampus, the heart of everything we do and we wanted to share the love we have for GraphQL and the ecosystem and also the hard lessons we learnt along the way. And its not just GraphQL, we will be diving deep into a lot of Open Source Tools, Libraries, Frameworks, Software and Practices as we go along.

I am pretty sure that we have a lot to talk about as we go along. So, why not start the series with an FAQ? That’s what we are going to do here. I have put together a set of questions and answered them as well below.

If you are new to GraphQL, I would recommend you to start with these links before jumping into this blog post:

Introduction to GraphQL - Learn about GraphQL, how it works, and how to use it

How to GraphQL - The Fullstack Tutorial for GraphQLThe free and open-source tutorial to learn all around GraphQL to go from zero to production

Explore GraphQL - This is your GraphQL study guide. Learn the fundamentals of schemas and queries, then implement some apps

GraphQL Tutorial - GraphQL is becoming the new way to use APIs in modern web and mobile apps. However, learning new things always takes

GraphQL Concepts Visualized - GraphQL is often explained as a "unified interface to access data from different sources"

And if you are keen to dig deep into the GraphQL Spec, it is hosted here

So, assuming you already know the basics of GraphQL, let’s jump right in.

Why should I move away from REST to GraphQL? What are the benefits?

I would start by saying that GraphQL does not make REST or any other channel of communication obsolete. It all boils down to your usecase. For small projects, the simplicity of REST might overweigh the advantages provided by GraphQL but as you have more teams, an evolving product, complex lifecycles and a data schema which gets bigger and bigger by the day, that’s when you will truly realize the value that GraphQL has to offer.

Credits: howtographql

In REST we try to structure different set of endpoints for different data paths, and if you see the REST Specification it does not offer a way to select only the data you want leading to over-fetching/under-fetching, does not offer type checking, no way to do introspection (unless you build an OpenAPI based documentation yourself) and this can also quickly become chatty since you have to end up calling different endpoints from the client to get different sets of data needed by the application. GraphQL solves all of these like this:

Credits: howtographql

And this is the beauty of it. It has a strong Type system, you can select just what you want avoiding over-fetching/under-fetching, you just have to talk to a single endpoint, the spec clearly defines about the execution of the queries (serial or parallel resolvers), its protocol independent unlike REST which relies on HTTP to do everything whereas you can even transmit your GQL queries through http, GRPC, Websockets — you name it.

What is the difference between HTTP, GRPC, GraphQL and others?

In summary, all of them are different. HTTP is a protocol by itself and does not define about the structure of the data transmitted via HTTP itself (The latest version is http 3) , GRPC uses protocol buffers to sends packets using http 2 as the protocol (and in the future can extend to use http 3 as well) and is often used for inter-service communications and GraphQL has nothing to do with the transport layer at all. It is just a specification for structuring and transmitting data to and fro different locations and it does not even matter even if you compress, encrypt or do anything with the queries and mutations as long as you have a logic to decompress or decrypt them on the server side. So, in summary they serve different purposes.

How do I version my GraphQL endpoints like I do in REST?

While there is nothing stopping you from having different versions to the GraphQL endpoints like /v1/graphql /v2/graphql or something along the same lines, GraphQL recommends you to have a continuously evolving version of your data graph. So, you can deprecate fields you no longer use, removing them at a later point of time, add new fields as and when you need without affecting the rest of the schema avoiding any conflicts which may occur otherwise.

What is the recommended way to define my schema?

Over time, people have developed a lot of abstractions on top of GraphQL that suddenly there seems like there are a lot of ways to define the schema.

Some ways including

Writing the SDL directly as .gql or .graphql files and then loading and parsing them
Using a library like Typegraphql to write your schema as code
Define them directly as JS/TS objects as defined here

and there are more and more can evolve over time.

One thing to understand is that, if you are using Node.js graphql-js would typically be the underlying implementation of all libraries and ultimately everything would get converted to JS/TS objects typically an AST ultimately making all these as abstractions on top of the existing way to define schemas. Note that the implementation can differ a bit in other languages or even within Node.js if you are using other ways of implementation like graphql-jit

What are some of the GraphQL servers available and how do they differ?

If you are using Node.js there are a lot of implementations of GraphQL servers with a few being express-graphql, apollo-server, mercurius, graphql-helix and more. And if you are using other languages, you can see a great list here

Now, talking in context with Node.js it all varies depending on your usecase.

Are you dependent on Apollo or its ecosystem like federation? Go for apollo-server
Do you use express as your framework? Use express-graphql
Are you using fastify or are looking for a performant graphql library with comprehensive support? Go for mercurius
Are you looking for making things as modular as possible, reduce bloat, and progressively extend functionality as you go along? Go for graphql-helix

Well, there are lot of things that I have not mentioned but this is just a start to decide which suggests some of the factors to keep into account.

And infact, if you are keen on understanding how every graphql-server performs, I would recommend checking out this

What is the best way to leverage GraphQL with Typescript?

Considering both GraphQL and Typescript are strongly typed, we can actually combine them together to give us an amazing experience with the help of some tooling. This will help us to make the end-end request-response lifecycle strongly typed.

For instance, there are some amazing projects from The Guild like GraphQL Codegen which we can use for generating types based on our local/remote schema with great Typescript integration, and you have a lot of plugins/recepies you can use along with it as well.

Want to generate Typescript objects based on GQL documents? You can try out Typed Document Node

Or do you want to directly code the schema in Typescript and maintain strict types? Try Typegraphql

Well, there are more examples like these and this is just a start.

How do I setup my Dev environment to work on GraphQL?

While this needs a separate blog post all by itself, here are some examples.

If you are using VSCode and are looking to enable syntax highlighting, validation, autocomplete, code-completion and so on, you can try using either VSCode GraphQL or Apollo GraphQL depending on which suits you better.
If you are working with Typescript it would be better to have codegen setup as part of your workflow.
If you want to validate your schema as and when you push to Version control to maintain sanity, setup something like GraphQL Inspector locally and in your CI/CD pipelines to maintain your sanity. If you use the Apollo ecosystem, it comes inbuilt in the Apollo Studio or the CLI tools which it gives you.
Want to have ESLint support to enforce standards and maintain sanity across your team, try something like GraphQL ESLint and set it up with your preferred conventions.
Setup a graphql-config and this will interface with other tooling like the codegen, VSCode GraphQL extension, GraphQL ESLint and more. This will help a lot since you have one config to manage all the interfacing tools. If you are using the Apollo Stack, you might need an apollo-config as well
If you want to keep your GraphQL code as modular as possible with support for things like dependency injection, try something like GraphQL Modules
Want to interface with multiple different data sources and integrations each with their own format but still have the experience of GraphQL when developing on top of them? Try something like GraphQL Mesh
Want to use a tool to test GraphQL endpoints? You might need something like Insomnia, Postman, Hoppscotch or VSCode REST Client

And while I can talk more about this, it will never end cause the ecosystem is too huge and thriving.

I use REACT/Angular /Vue/Web Components. How do I integrate GraphQL with my components?

Again, the front end ecosystem is huge as well with its own set of tooling and libraries.

In my case, I typically try to work on the frontend without any framework (I use Lit Elements in my case and we will have a separate blog on that soon), the tool you use completely depends on your requirements here.

Apollo Client does have a good integration with these frameworks including React, iOS and Android — so, you might want to check that out
Using React? Relay can be a great choice
Using Vue? You can try Vue Apollo
Using web components with Apollo Stack for GQL? You might want to check out Apollo Elements
Using vanilla JS or TS or using web components and want to have a framework-independent way of doing things? You can stick to the GraphQL codegen itself since it takes care of almost everything underneath. Or if you want, you can also use Apollo Client’s vanilla version @apollo/client/core. Apollo Elements does come with support for a lot of webcomponent libraries like Lit, Fast and Gluon or even without any of it and hence is quite flexible.
Or if you are just looking for a lightweight, performant and extensible GraphQL client, URQL can be great as well.
Or if you are looking for a minimal client which runs both in the Browser and Node, you can try GraphQL Request

Well, there are lot of other ways we haven’t talked about and this is just a start.

What are some of the ways in which I can maintain performance while using GraphQL?

While GraphQL is really promising and helpful, you have to understand that like any technology or framework, it does come with its own set of problems, most of which have already been addressed. For instance you might have heard about the N+1 problem, lack of caching, Query cost and complexity and so on and these have been addressed by some projects like the Dataloader, Persisted Queries, Caching and more which you can setup depending on your needs.

Ultimately it depends on the degree of the flexibility you want to offer. The more the flexibility, the more the cost. And it is your decision to decide it based on your usecase.

What are some of the principles or standards to be followed when trying to build my datagraph architecture?

Some amazing people have already answered this here and I highly recommend you to go through it before starting off your journey with GraphQL.

And if you are looking for some help with the rules and implementation details with respect to GraphQL, you can find a great doc on this here

While all of these are principles trying to guide you in the right direction, choose what is best for your usecase and work with it.

How do I use GraphQL to interact with multiple sources of data?

One of the great examples of real-world implementation of this would be Gatsby where irrespective of the source of data, everything ultimately gets converted to GraphQL with plugins which can then be used in your workflow.

If you are to build it in the server side, either you can use an out of the box solution like GraphQL Mesh or you can build it on your own since GraphQL just acts as an abstraction on top.

Or if you are on the apollo stack and want to connect to multiple data sources, you can have a look at apollo-datasource

Or you want to have a single ORM which closely resembles GraphQL like Prisma to integrate with multiple databases underneath

Ultimately it all boils down to how you structure your resolvers.

But, it does not stop here. Some databases also support GraphQL either via adapters or natively as well.

For eg.

Dgraph has a native GraphQL implementation
Neo4j has a GraphQL adapter
Hasura provides a GraphQL abstraction on top of your datasources
Postgraphile can help if you use Postgres

Well, these are just some of the tools and services. There are more like this which can help.

The GraphQL spec is missing some of the types like DateTime, GeoLocation and more. How do I implement that?

Yes, this can be painful. But, it is by design to keep GraphQL as lean and lightweight as possible.

This is where GraphQL Scalars really help. You can define your own types and use them across your schema if they are not supported out of the box.

But, this can be tedious to implement and using a package like graphql-scalars can actually help since it comes inbuilt with some of the commonly used scalars which you can import and use.

There are some fields which I find myself repeating between various queries and mutations. How do I avoid doing this?

As the DRY principle goes, we can also make our operations modular with the help of GraphQL Fragments and then use those fragments as applicable anywhere.

Can’t I convert my Database schema directly to a GraphQL schema or generate a GraphQL schema?

While technically, it is possible and this is what database providers who offer a GraphQL layer on top use like Hasura or Graphcool — It is highly not recommended for client consumption and I would also recommend you to read this to get more idea.

The main reason to this according to me is that GraphQL is meant to describe the Data Graph which revolves around the business/domain terminologies without involving the underlying technical complexity or details. For instance, one should not care about which table a specific field comes from, how to join, and so on.

It should just be about the business implementation for the end users so even a product manager who does not know about the underlying technical implementation can use it.

So, while you may use GraphQL as sort of an ORM for your databases or other data sources, exposing that directly to the clients is not a good option. Rather, there should be one more layer on top just to have it make sense for any end user and reduce the complexity for clients.

Are there some helpers libraries I can use to work with my GraphQL schemas?

Yes. GraphQL Tools (which was initially from Apollo and then taken over by the Guild) is one of those libraries which I highly recommend. You can do a lot of operations on your SDL or schema like merging multiple schemas, mocking your schemas with test data, building custom directives, loading remote schemas and so on which you can add as part of your stack.

What is the best strategy to distribute your schema? What if I am using Microservices with GraphQL?

While GraphQL is meant to be a single endpoint or provide a single unified view of the data for the clients, it is often not possible to do it all at one place since it can create a lot of bottlenecks. This is why Schema stitching or Apollo Federation came into place where multiple subschemas can contribute to the unified data graph.

While we can have a separate blog on Schema Stitching versus Federation sometime down the line, each have its own set of merits and demerits which you can understand only if you give both a try.

These videos can help get some basics (but a lot has changed since these videos were released especially with GraphQL Tools introducing Type Merging):

If you are still confused on what to go for, you can also read this blog about stitching and federation.

What are some of the GraphQL events/conferences to watch out for?

Since GraphQL was released, it garnered a huge interest in the community that a lot of conferences, events and meetups are held around the world keeping GraphQL as the main theme. Some of them are:

and there are more including meetups like these and these. You can find most of the previous sessions recorded on Youtube if you search for it.

How can I contribute to GraphQL and its ecosystem?

Every bit of help really counts since GraphQL foundation is run by a set of volunteers and its all open source. You can

Write blogs like this to spread knowledge amongst the community
Host meetups, speak in conferences about your experience and evangelize your best way possible.
Contribute to the GraphQL spec with your suggestions (Some suggestions may take years to implement even if it is good, so you may need to have a lot of patience for this)
Contribute to the ecosystem of tools leveraging GraphQL be it with documentation, tests, features, bug fixes, feedback and what not. It will definitely help.
Facing a challenge with GraphQL which has not been solved before? Build your own tooling and contribute it to the community
Create failing tests and reproducible projects
Answer and help others on Github Issues, Discord, Stack Overflow, Twitter, Reddit. There are a lot of amazing GraphQL communities out there.
Or if you want to take it to the next level and want to align your entire organization to help the GraphQL foundation, become its member and contribute.

There are a lot of small ways in which you can give back. Small or big does not matter. Every contribution counts.

Are there some case studies which can actually help me in the implementation?

Sure. While I can’t list them all here, here are some:

and you can find more here

Are there any publicly available GraphQL APIs which I can play around with?

Yes. While most of them would require you to authenticate, they are available for you to use. Some examples:

You can have a look at more like these here and play around with it.

I have a legacy architecture/stack as part of my organization. How do I incrementally migrate to GraphQL?

This is one of the places where GraphQL really shines. You need not move everything over at one piece. Here are some steps which might help.

First, build a Datagraph for your entire business without worrying about the underlying logic/implementation. But don’t worry too much since you can always evolve this over time.
Next, implement resolvers for every part of the schema in such a way that at phase 1, you just wrap your existing infrastructure with GraphQL. For instance, if your services use SOAP, you can add a GraphQL layer on top of it and calling that can all the SOAP service underneath and the client need not worry about it. You can use something like GraphQL Mesh or SOFA which can help in abstracting these. There is a good blog post on how to migrate from REST to GraphQL here.
Change the client implementation one by one to call the GraphQL gateway instead of the legacy service.
Now that you have GraphQL working in your ecosystem, you can incrementally move away from legacy implementations like SOAP without having to worry about how it will affect the clients progressively, one component at a time to use a native GraphQL implementation.

While this is one possible approach, this is not the only approach. There are lot of other ways in which you can take this one step at a time without worrying about the legacy code you have.

How do I secure my GraphQL endpoint?

While the GraphQL spec itself does not recommend any specific way to do this and leaves it to the person implementing it, you can either use JWT, Cookies, Sessions and so on like you normally would when authenticating through other mechanisms.

How do I enable authorization to my GraphQL fields or schema?

This is very powerful in GraphQL since you can do a authorization at a very fine grained level be it at the type level or at the field level. You can read this blog which suggests various ways in which you can do authorization.

You can also use libraries like GraphQL Shield which offers powerful middlewares to do this. But remember that authorization does come with attached cost since you are running a specific logic in/before your resolvers for all the fields which you want to authorize.

One often overlooked way is the use of directives to do authorization, one example of which is mentioned in this blog and this is very powerful and declarative. This way, you can specify the scope and add the directive to the respective fields in your SDL and it can do the job for you.

How do I enable real-time applications like Chat, auto-updates and so on in my application with GraphQL?

There are some options currently to do this.

The first would be to use GraphQL Subscriptions which is part of the spec. You have to register the subscriptions upfront and also have support for Websockets if you want to do this.
Another way is to do periodic long-time polling which can work at a small scale keeping your application stateless.
An another way is to use live queries

Each option comes up with its own set of advantages and disadvantages again. Just remember that it is not often possible to keep your application stateless if you want something like Subscriptions. So, make sure you manage the state well and plan for failures and scaling your app.

And if you are new to subscriptions, you can probably watch this to get an idea about the basics of how subscription works.

What can I even do with introspection?

Introspection is typically used by the tooling to understand your GraphQL types and schema. For instance, tools like GraphQL Voyager can introspect your schema and build amazing graphs, and almost all extensions built around GraphQL leverage this power to understand your schema, types and everything around it.

Note that, it is recommended by experts to have introspection disabled in production due to security and performance reasons.

How do I do tracing of all operations in GraphQL?

There are various ways in which you can do this.

If you want to do this on your own, you can send traces or contexts from within the resolvers using the Jaeger/Opentelemetry SDKs and send all the information manually for tracing.
Opentelemetry has recently made support for GraphQL available. You can find it here
But if you find yourself using the Apollo Stack, Apollo comes with its own tracing options like Apollo Tracing and you can read about it here

Just remember that Tracing will cause a lot of performance overhead and it is highly recommended to have it off unless otherwise needed or probably use it only for specific layers of concern.

How do I handle errors gracefully?

Again, there are a lot of ways to do this.

If you use the Apollo stack, you can use the apollo-errors package as documented here
If you use express-graphql or want to use graphql-js natively, they expose error functions as well based on GraphQLError and can also use GraphQL extensions to augment with custom payload like error codes and so on which what you typically do when using servers like graphql-helix.

Now, this is the case since GraphQL does not have any dependence on the transport layer and thus status codes like 200, 400 or 500 may not make sense unless they are part of the response and the spec does not prescribe a specific way to do this as well.

Is GraphQL related to Graph databases in some way?

While GraphQL encourages you to think of your entire data as graphs of connected information since that would give a better insight into how to structure your schema leading to a unified data graph, it has no relation with Graph databases by itself since Graph databases act as a way to represent and store data in underlying storage systems to allow for fast traversal, walking and retrieval.

But, that being said, GraphQL and Graph Databases do have a lot of synergy between them. You can read about that here and here since its all about establishing the data schema and its relationships.

When exposing REST APIs to end users, I used to bill users based on API calls made. How do I do this for GraphQL?

This can be a challenging problem cause in GraphQL, its the clients which decide what to query/mutate and the server might not know that upfront unless you are using something like persisted queries.

And here, the CPU consumed can depend on the level of nesting of the queries, the operations which your resolvers do and so on making it difficult to estimate the costs upfront. You can find a detailed blog about this here.

One way to handle this only allow persisted queries and approve them and assign costs to them upfront, but this can get tricky to manage over the long run as number of queries and mutations increase.
Another way is to use custom cost directives as in this package manually specifying the complexity and cost and using that to bill your APIs

This is relatively a new area and still under exploration. For instance, Dgraph bills for Slash GraphQL based on the nodes accessed as mentioned here which can be valid for databases using GraphQL but not necessarily for GraphQL api by itself.

Here are some other resources which hosts FAQs on GraphQL as well

And there are more. Just google for it.

Hope this was informative. Do you have any question which I have not covered in this list or are you looking for some help? Let me know by reaching out to me @techahoy.

And if this helped, do share this across with your friends, do hang around and follow us for more like this every week. See you all soon.