#Tech Blog

#Flagger

Kubernetes Native Continuous Delivery with FluxCD, Flagger and Linkerd

Why are so many software development organizations adopting agile in recent years? Probably, because it allows a much better fit between the R&D and the ever changing dynamic business goals. Agile allows you to deliver new functionality fast, doing it in small iterations that are easier to review, test and validate automatically, reducing the overall risk of delivering new functionality and taking the human factor and manual gating out of the equation.

In short, it allows us to iterate fast without breaking stuff. Building a Continuous Delivery with automated gates, checks and rollback ability builds confidence and allows us to innovate faster and creates a tighter feedback loop.

Key Components

Continuous Delivery architecture

Linkerd

Linkerd is a layer-7 proxy used as an abstraction layer for communication between components in our system, it also moves common logic from our code to a central configurable control-plane like timeouts, retries and more. It allows the system to dynamically make a decision when service A is trying to communicate with service B. The decision may have to do with enforcing a security policy, or, like in our case, with routing traffic to specific service.

Read more about service-mesh here

FluxCD

Flux is a GitOps system (under CNCF) that helps us keep our cluster configs and deployments in sync across multiple environments through a simple git repository. GitOps keeps the flow you are familiar with like code-reviews to streamline the process.
Flux can also pull data from a container registry, getting information about new images from registries and deploying them automatically.

Note: we use FluxCD for deployments but you can use any other way to trigger deployments on Kubernetes and it will have the same outcome together with Flagger.

Using GitOps for continuous delivery

Flagger

Flagger is a Kubernetes operator for automating the promotion of canary deployments with progressive traffic shifting. It can leverage a number of proxies for traffic shifting like Linkerd, Istio, Nginx etc.
Flagger also runs canary-analysis before each promotion step using Prometheus and other metrics sources including webhooks. The analysis process can run tests, check for elevated error rate, high latency and other requirements that you consider for a healthy deployment. These indicators will help flagger decide whether to continue the promotion or triggering a rollback.

Alt Text

Prometheus

We use Prometheus as our main monitoring system, which also plays a huge role in our canary-analysis process. Flagger will query it periodically to gain insights about the service it’s trying to promote.

Config Snippets

Linkerd (Service-Mesh)

# install linkerd on the target cluster
linkerd install | kubectl apply -f -

Flux

# kustomization.yaml
---
namespace: flux
bases:
  - github.com/fluxcd/flux//deploy
patchesStrategicMerge:
  - patch.yaml
# patch.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: flux
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
    spec:
      containers:
        - name: flux
          args:
            - --listen-metrics=:3031
            - --manifest-generation=true
            - --memcached-hostname=memcached.flux
            - --memcached-service=
            - --ssh-keygen-dir=/var/fluxd/keygen
            - --ssh-keygen-bits=521
            - --ssh-keygen-type=ed25519
            - --git-url=git@github.com:<orgName>/<kubernetes-config-repo>
            - --git-branch=master
            - --git-path=production
            - --git-user=flux
            - --git-poll-interval=5m
            - --sync-interval=5m
            - --sync-timeout=2m
            - --sync-garbage-collection=true
# take the generates ssh key and put it as deploy key in your Github repo.
kubectl -n flux get secret flux-git-deploy -o json | jq -r .data.identity | pbcopy

Repository

Setting up the repo should be up to you, we are using kustomize for manifest generation but one can leverage Helm or other tools for that matter.

/.flux.yaml
/base
  /podinfo
/production
  /podinfo
# .flux.yaml
---
version: 1
patchUpdated:
  generators:
    - command: kubectl kustomize .
  patchFile: flux-patch.yaml

Flagger

kubectl apply -k github.com/weaveworks/flagger//kustomize/linkerd?ref=0.23.0

Now to the real fun part:

We will run podinfo container and will update it to test the canary rollout.

kubectl create ns test
kubectl annotate namespace test linkerd.io/inject=enabled
kubectl apply -k github.com/weaveworks/flagger//kustomize/tester
# production/kustomization.yaml
---
namespace: flux
bases:
  - github.com/weaveworks/flagger//kustomize/podinfo
resources:
  - podinfo/canary.yaml
patchesStrategicMerge:
  - podinfo/patch.yaml
# production/podinfo/patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: podinfo
spec:
  template:
    spec:
      containers:
      - name: podinfod
        image: stefanprodan/podinfo:3.1.0
# production/podinfo/canary.yaml
apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: podinfo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  service:
    # ClusterIP port number
    port: 9898
    # container port number or name (optional)
    targetPort: 9898
  canaryAnalysis:
    # schedule interval (default 60s)
    interval: 30s
    # max number of failed metric checks before rollback
    threshold: 5
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 5
    # Linkerd Prometheus checks
    metrics:
    - name: request-success-rate
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      threshold: 99
      interval: 1m
    - name: request-duration
      # maximum req duration P99
      # milliseconds
      threshold: 500
      interval: 30s
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary.test:9898/token | grep token"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 2m -q 10 -c 2 http://podinfo-canary.test:9898/"

Now you can commit and push the changes, watch for new resources in the test namespace.

kubectl get all --namespace=test

Once the rollout completed you can deploy a new version by committing the following.

# production/podinfo/patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: podinfo
spec:
  template:
    spec:
      containers:
      - name: podinfod
        image: stefanprodan/podinfo:3.1.1

Summary, Tips & Tricks

When implementing the above approach on an existing system, naturally, you will need to plan how to do this gradually. There is a natural order of pre-conditions, that have to be fulfilled.

  1. Ingress-controller should call components via Service Mesh or be able to understand traffic-splitting without it.
  2. Traffic splitting happens in the originating client, hence it needs to be configured in the origin and not the recieving end, if you use service mesh it will be transparent but do notice that client-facing services will need a supported proxy1 or service-mesh2 that is able to understand traffic splitting.
  3. Prometheus – We can take “health metrics” from the service mesh, or, we can make more intelligent decisions with each component delivering its own in-depth health metrics

1: See supported proxies on Flagger’s website

2: Check the docs on how to inject linkerd into your ingress