Why are so many software development organizations adopting agile in recent years? Probably, because it allows a much better fit between the R&D and the ever changing dynamic business goals. Agile allows you to deliver new functionality fast, doing it in small iterations that are easier to review, test and validate automatically, reducing the overall risk of delivering new functionality and taking the human factor and manual gating out of the equation.
In short, it allows us to iterate fast without breaking stuff. Building a Continuous Delivery with automated gates, checks and rollback ability builds confidence and allows us to innovate faster and creates a tighter feedback loop.
Linkerd is a layer-7 proxy used as an abstraction layer for communication between components in our system, it also moves common logic from our code to a central configurable control-plane like timeouts, retries and more. It allows the system to dynamically make a decision when service A is trying to communicate with service B. The decision may have to do with enforcing a security policy, or, like in our case, with routing traffic to specific service.
Read more about service-mesh here
Flux is a GitOps system (under CNCF) that helps us keep our cluster configs and deployments in sync across multiple environments through a simple git repository. GitOps keeps the flow you are familiar with like code-reviews to streamline the process.
Flux can also pull data from a container registry, getting information about new images from registries and deploying them automatically.
Note: we use FluxCD for deployments but you can use any other way to trigger deployments on Kubernetes and it will have the same outcome together with Flagger.
Flagger is a Kubernetes operator for automating the promotion of canary deployments with progressive traffic shifting. It can leverage a number of proxies for traffic shifting like Linkerd, Istio, Nginx etc.
Flagger also runs canary-analysis before each promotion step using Prometheus and other metrics sources including webhooks. The analysis process can run tests, check for elevated error rate, high latency and other requirements that you consider for a healthy deployment. These indicators will help flagger decide whether to continue the promotion or triggering a rollback.
We use Prometheus as our main monitoring system, which also plays a huge role in our canary-analysis process. Flagger will query it periodically to gain insights about the service it’s trying to promote.
# install linkerd on the target cluster linkerd install | kubectl apply -f -
# kustomization.yaml --- namespace: flux bases: - github.com/fluxcd/flux//deploy patchesStrategicMerge: - patch.yaml
# patch.yaml --- apiVersion: apps/v1 kind: Deployment metadata: name: flux spec: template: metadata: annotations: prometheus.io/scrape: "true" spec: containers: - name: flux args: - --listen-metrics=:3031 - --manifest-generation=true - --memcached-hostname=memcached.flux - --memcached-service= - --ssh-keygen-dir=/var/fluxd/keygen - --ssh-keygen-bits=521 - --ssh-keygen-type=ed25519 - --email@example.com:<orgName>/<kubernetes-config-repo> - --git-branch=master - --git-path=production - --git-user=flux - --git-poll-interval=5m - --sync-interval=5m - --sync-timeout=2m - --sync-garbage-collection=true
# take the generates ssh key and put it as deploy key in your Github repo. kubectl -n flux get secret flux-git-deploy -o json | jq -r .data.identity | pbcopy
Setting up the repo should be up to you, we are using kustomize for manifest generation but one can leverage Helm or other tools for that matter.
/.flux.yaml /base /podinfo /production /podinfo
# .flux.yaml --- version: 1 patchUpdated: generators: - command: kubectl kustomize . patchFile: flux-patch.yaml
kubectl apply -k github.com/weaveworks/flagger//kustomize/linkerd?ref=0.23.0
Now to the real fun part:
We will run podinfo container and will update it to test the canary rollout.
kubectl create ns test kubectl annotate namespace test linkerd.io/inject=enabled kubectl apply -k github.com/weaveworks/flagger//kustomize/tester
# production/kustomization.yaml --- namespace: flux bases: - github.com/weaveworks/flagger//kustomize/podinfo resources: - podinfo/canary.yaml patchesStrategicMerge: - podinfo/patch.yaml
# production/podinfo/patch.yaml apiVersion: apps/v1 kind: Deployment metadata: name: podinfo spec: template: spec: containers: - name: podinfod image: stefanprodan/podinfo:3.1.0
# production/podinfo/canary.yaml apiVersion: flagger.app/v1alpha3 kind: Canary metadata: name: podinfo namespace: test spec: # deployment reference targetRef: apiVersion: apps/v1 kind: Deployment name: podinfo # HPA reference (optional) autoscalerRef: apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscaler name: podinfo # the maximum time in seconds for the canary deployment # to make progress before it is rollback (default 600s) progressDeadlineSeconds: 60 service: # ClusterIP port number port: 9898 # container port number or name (optional) targetPort: 9898 canaryAnalysis: # schedule interval (default 60s) interval: 30s # max number of failed metric checks before rollback threshold: 5 # max traffic percentage routed to canary # percentage (0-100) maxWeight: 50 # canary increment step # percentage (0-100) stepWeight: 5 # Linkerd Prometheus checks metrics: - name: request-success-rate # minimum req success rate (non 5xx responses) # percentage (0-100) threshold: 99 interval: 1m - name: request-duration # maximum req duration P99 # milliseconds threshold: 500 interval: 30s # testing (optional) webhooks: - name: acceptance-test type: pre-rollout url: http://flagger-loadtester.test/ timeout: 30s metadata: type: bash cmd: "curl -sd 'test' http://podinfo-canary.test:9898/token | grep token" - name: load-test type: rollout url: http://flagger-loadtester.test/ metadata: cmd: "hey -z 2m -q 10 -c 2 http://podinfo-canary.test:9898/"
Now you can commit and push the changes, watch for new resources in the
kubectl get all --namespace=test
Once the rollout completed you can deploy a new version by committing the following.
# production/podinfo/patch.yaml apiVersion: apps/v1 kind: Deployment metadata: name: podinfo spec: template: spec: containers: - name: podinfod image: stefanprodan/podinfo:3.1.1
Summary, Tips & Tricks
When implementing the above approach on an existing system, naturally, you will need to plan how to do this gradually. There is a natural order of pre-conditions, that have to be fulfilled.
- Ingress-controller should call components via Service Mesh or be able to understand traffic-splitting without it.
- Traffic splitting happens in the originating client, hence it needs to be configured in the origin and not the recieving end, if you use service mesh it will be transparent but do notice that client-facing services will need a supported proxy1 or service-mesh2 that is able to understand traffic splitting.
- Prometheus – We can take “health metrics” from the service mesh, or, we can make more intelligent decisions with each component delivering its own in-depth health metrics