Using helm with Black Box

Posted on Sep 1, 2023 4 mins read


Overview

Chart

In most cases we can monitor things just by using simple tools that already exist like k8s event logs. Or by doing things like checking for the number of running pods on a deployment and alerting on < 0. This would indicate that we have 0 running pods in a deployment or stateful set. In some cases we want to do some more specific things like looking at endpoint status or synthetic testing.

In our case we’re using a shared_cluster which is a k8s platform maintained by another team. This shared_cluster is highly leveraged around security, which means by default all endpoints created by Istio Virtual Services ( in most cases AWS NLB -> k8s services ) are protected by both OAuth and a whitelisted firewall.

The use case for this work was to create a monitoring system which could alert on a few important things about each of our running services while keeping in mind the highly secure nature of the cluster:

  • DNS entries are configured correctly
    • This is usually out of scope for most monitoring systems
  • Service endpoints are protected by both whitelisted firewall rules and OAuth
  • Specific endpoints were whitelisted on the firewall but had no OAuth protection
  • Endpoints where returning specific responses for both http and gRPC
  • http endpoints properly redirected to https services
  • TLS is properly resolved on all services

Example

Workflow:

  • Create a deployment which runs:
  • Telegraf scraps the pg-exporter and bb-exporter ports for data
  • Telegraf exposes a /metrics endpoint which is scrapped by our WaveFront collector and sent to WaveFront

Module: OAuth blocked

In this case what we want to do is ensure that an enpoint is blocking requests that have no oauth information. This can help us ensure that our endpoints are properly setup for OAuth. The use case here is that someone at some point might accidentally remove or alter the oauth config and cause the authentication to stop working, leaving our endpoints exposed without any protection.

This could happen for a variety of reasons, but it’s always the reasons we don’t see coming that do the most harm.

This is our safety valve to make sure that we’re constantly testing and validating our authentication status.

[[inputs.prometheus]]
urls = [ "http://localhost:9115/probe?module=https_oauth_blocked&target=https://css-myapp-dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.w3n.io"]
      https_oauth_blocked:
        prober: http
        timeout: 5s
        http:
          valid_http_versions: [ "HTTP/1.1", "HTTP/2.0" ]
          valid_status_codes: [ 302 ]
          method: GET
          no_follow_redirects: true
          fail_if_ssl: false
          fail_if_not_ssl: true
          fail_if_header_not_matches:
            - header: location
              regexp: 'login.windows.net'
          tls_config:
            insecure_skip_verify: false
          preferred_ip_protocol: "ip4"
          ip_protocol_fallback: false

All pages protected with OAuth validation will return a 302 to login.windows.net.

DNS checks

In this case we are checking more than just endpoints. Here we’re validating that our DNS records are setup and continue to be setup correctly. This is like a constantly running validation of our externally facing system.

This is another saftey check to make sure that our systems stay configured the way we expect them to be configured.

It’s not impossible for something to get knocked out of alignment somewhere. It has happened, and this is what “reliability” means in SRE.

      css.dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.mydomain.com:
        prober: dns
        timeout: 5s
        dns:
          query_name: "css.dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.mydomain.com"
          query_type: "A"
          transport_protocol: 'udp'
          preferred_ip_protocol: "ip4"

This will check for a DNS entry that is configured for an A host.

In some cases we’ve had people outright remove the Istio Virtual Service with a bad change to the helm chart which knocked this out of existence for a bit. Not a huge deal, but we were able to detect the change quickly and push out a fix once we got the alarm.

Conclusion

Early alarming like this is critical when we want to enable developers to move fast and build things.

I believe the role of DevOps and/or SRE is to build things like this to catch the common failures, alert, and remediate quickly. This stands apart from other modes that might use something like a change management process to gate and review every possible change that goes to production.

I will always lean on less paper work and more automation because moving at the speed of business requires this kind of agile thinking.

 helm automation blackbox

Share This Post