Posted on Sep 1, 2023 4 mins read
In most cases we can monitor things just by using simple tools that already exist like k8s event logs. Or by doing things like checking for the number of running pods on a deployment and alerting on < 0. This would indicate that we have 0 running pods in a deployment or stateful set. In some cases we want to do some more specific things like looking at endpoint status or synthetic testing.
In our case we’re using a shared_cluster
which is a k8s platform maintained by another team. This
shared_cluster
is highly leveraged around security, which means by default all endpoints created by
Istio Virtual Services ( in most cases AWS NLB -> k8s services ) are protected by both OAuth and a
whitelisted firewall.
The use case for this work was to create a monitoring system which could alert on a few important things about each of our running services while keeping in mind the highly secure nature of the cluster:
http
and gRPC
http
endpoints properly redirected to https
servicesWorkflow:
/metrics
endpoint which is scrapped by our WaveFront collector and sent to WaveFrontIn this case what we want to do is ensure that an enpoint is blocking requests that have no oauth information. This can help us ensure that our endpoints are properly setup for OAuth. The use case here is that someone at some point might accidentally remove or alter the oauth config and cause the authentication to stop working, leaving our endpoints exposed without any protection.
This could happen for a variety of reasons, but it’s always the reasons we don’t see coming that do the most harm.
This is our safety valve to make sure that we’re constantly testing and validating our authentication status.
[[inputs.prometheus]]
urls = [ "http://localhost:9115/probe?module=https_oauth_blocked&target=https://css-myapp-dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.w3n.io"]
https_oauth_blocked:
prober: http
timeout: 5s
http:
valid_http_versions: [ "HTTP/1.1", "HTTP/2.0" ]
valid_status_codes: [ 302 ]
method: GET
no_follow_redirects: true
fail_if_ssl: false
fail_if_not_ssl: true
fail_if_header_not_matches:
- header: location
regexp: 'login.windows.net'
tls_config:
insecure_skip_verify: false
preferred_ip_protocol: "ip4"
ip_protocol_fallback: false
All pages protected with OAuth validation will return a 302 to login.windows.net
.
In this case we are checking more than just endpoints. Here we’re validating that our DNS records are setup and continue to be setup correctly. This is like a constantly running validation of our externally facing system.
This is another saftey check to make sure that our systems stay configured the way we expect them to be configured.
It’s not impossible for something to get knocked out of alignment somewhere. It has happened, and this is what “reliability” means in SRE.
css.dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.mydomain.com:
prober: dns
timeout: 5s
dns:
query_name: "css.dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.mydomain.com"
query_type: "A"
transport_protocol: 'udp'
preferred_ip_protocol: "ip4"
This will check for a DNS entry that is configured for an A host.
In some cases we’ve had people outright remove the Istio Virtual Service with a bad change to the helm chart which knocked this out of existence for a bit. Not a huge deal, but we were able to detect the change quickly and push out a fix once we got the alarm.
Early alarming like this is critical when we want to enable developers to move fast and build things.
I believe the role of DevOps and/or SRE is to build things like this to catch the common failures, alert, and remediate quickly. This stands apart from other modes that might use something like a change management process to gate and review every possible change that goes to production.
I will always lean on less paper work and more automation because moving at the speed of business requires this kind of agile thinking.
helm automation blackbox