helm automation blackbox
In most cases we can monitor things just by using simple tools that already exist like k8s event logs. Or by doing things like checking for the number of running pods on a deployment and alerting on < 0. This would indicate that we have 0 running pods in a deployment or stateful set. In some cases we want to do some more specific things like looking at endpoint status or synthetic testing.
In our case we’re using a shared_cluster
which is a k8s platform maintained by another team. This
shared_cluster
is highly leveraged around security, which means by default all endpoints created by
Istio Virtual Services ( in most cases AWS NLB -> k8s services ) are protected by both OAuth and a
whitelisted firewall.
The use case for this work was to create a monitoring system which could alert on a few important things about each of our running services while keeping in mind the highly secure nature of the cluster:
http
and gRPC
http
endpoints properly redirected to https
servicesWorkflow:
/metrics
endpoint which is scrapped by our WaveFront collector and sent to WaveFrontIn this case what we want to do is ensure that an enpoint is blocking requests that have no oauth information. This can help us ensure that our endpoints are properly setup for OAuth. The use case here is that someone at some point might accidentally remove or alter the oauth config and cause the authentication to stop working, leaving our endpoints exposed without any protection.
This could happen for a variety of reasons, but it’s always the reasons we don’t see coming that do the most harm.
This is our safety valve to make sure that we’re constantly testing and validating our authentication status.
[[inputs.prometheus]]
urls = [ "http://localhost:9115/probe?module=https_oauth_blocked&target=https://css-myapp-dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.w3n.io"]
https_oauth_blocked:
prober: http
timeout: 5s
http:
valid_http_versions: [ "HTTP/1.1", "HTTP/2.0" ]
valid_status_codes: [ 302 ]
method: GET
no_follow_redirects: true
fail_if_ssl: false
fail_if_not_ssl: true
fail_if_header_not_matches:
- header: location
regexp: 'login.windows.net'
tls_config:
insecure_skip_verify: false
preferred_ip_protocol: "ip4"
ip_protocol_fallback: false
All pages protected with OAuth validation will return a 302 to login.windows.net
.
In this case we are checking more than just endpoints. Here we’re validating that our DNS records are setup and continue to be setup correctly. This is like a constantly running validation of our externally facing system.
This is another saftey check to make sure that our systems stay configured the way we expect them to be configured.
It’s not impossible for something to get knocked out of alignment somewhere. It has happened, and this is what “reliability” means in SRE.
css.dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.mydomain.com:
prober: dns
timeout: 5s
dns:
query_name: "css.dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.mydomain.com"
query_type: "A"
transport_protocol: 'udp'
preferred_ip_protocol: "ip4"
This will check for a DNS entry that is configured for an A host.
In some cases we’ve had people outright remove the Istio Virtual Service with a bad change to the helm chart which knocked this out of existence for a bit. Not a huge deal, but we were able to detect the change quickly and push out a fix once we got the alarm.
Early alarming like this is critical when we want to enable developers to move fast and build things.
I believe the role of DevOps and/or SRE is to build things like this to catch the common failures, alert, and remediate quickly. This stands apart from other modes that might use something like a change management process to gate and review every possible change that goes to production.
I will always lean on less paper work and more automation because moving at the speed of business requires this kind of agile thinking.
Read More →terraform gitlab cicd
Using terraform workspaces with a little python code and gitlab cicd pipelines to create a dynamic, interesting pipeline.
Company: Renovo
We wanted to create a single terraform manifest that could express complicated infrastructure requirements with environmental deltas. A single, simple implementation, but with each environment express uniqueness based on the environmental needs.
➜ raleigh git:(master) terraform workspace list
* default
➜ raleigh git:(master) *terraform workspace new stage*
Created and switched to workspace "stage"!
You're now on a new, empty workspace. Workspaces isolate their state,
so if you run "terraform plan" Terraform will not see any existing state
for this configuration.
➜ raleigh git:(master) terraform workspace new production
Created and switched to workspace "production"!
You're now on a new, empty workspace. Workspaces isolate their state,
so if you run "terraform plan" Terraform will not see any existing state
for this configuration.
➜ raleigh git:(master) terraform workspace list
default
* production
stage
aws jq
Handy tricks for pulling out AWS data. From time to time I find myself needing some quick and dirty reports detailing aspects of a given AWS account. These helper scripts have helped me in my cunsulting work as no two clients are ever using the same set of tools, and as a cunsultant I often find myself retrained by security contraints.
In general most of my code ( be it ruby, or jq ) uses some kind of simple cache mechanism. In most of these examples I'm making the aws cli call and passing the data to a file. I then use the file with jq. This helps me tune and debug the data as needed. I can also, obviously, reuse the data for different queries. Some day I might get around to making this a little more sophisticated, but for now I usually run the script once, then comment the line out and run it again as many times as is needed.
This gives me a mapping of the instance types used. This is most often useful in environments where we're using mostly static instance builds without any clustering ( EKS, ECS ) or ASG's. This is a sure sign the the client does not have a clean way to deploy code to newly created instances, which is a huge red flag.
aws ec2 describe-instances --query "Reservations[].Instances[].{InstanceType:InstanceType}" > /tmp/instance_types.json
cat /tmp/instance_types.json | jq -r 'group_by(.InstanceType) | map({type: .[0].InstanceType, count: length}) | sort_by(.count)[] | [.type, .count|tostring] | join("\t")'
This is a great way to get a sense for the size and usage of EBS blocks.
#!/bin/bash
aws ec2 describe-volumes > /tmp/volumes.json
## Report total size of all ebs volumes
echo "---Total used---"
cat /tmp/volumes.json|jq '.Volumes|reduce .[].Size as $item (0; . + $item)'
## Report breakdown of types.
echo "---Types---"
cat /tmp/volumes.json|jq -r '.Volumes| group_by(.VolumeType)| map({volume: .[0].VolumeType, count: length}) | sort_by(.count)[] | [.volume, .count|tostring] | join("\t")'
## Report on size of volumes
echo "---Size (size in GiB / number )---"
cat /tmp/volumes.json|jq -r '.Volumes| group_by(.Size)| map({size: .[0].Size, count: length}) | sort_by(.size)[] | [.size, .count|tostring] | join("\t")'
echo "---Usage---"
cat /tmp/volumes.json|jq -r '.Volumes| group_by(.State)| map({state: .[0].State, count: length}) | sort_by(.count)[] | [.state, .count|tostring] | join("\t")'
I've seen cases where Lambda functions are forgotten about and snapshots end up piling up over time. It's usually not a big deal since snaps are cheap, but sometimes handy to know.
#!/bin/bash
aws ec2 describe-snapshots --owner-id=349250784145 > /tmp/snapshots.json
## Report total size of all ebs volumes
echo "---Total used---"
cat /tmp/snapshots.json|jq '.Snapshots|reduce .[].VolumeSize as $item (0; . + $item)'
## Report on size of volumes
echo "---Size (size in GiB / count)---"
cat /tmp/snapshots.json|jq -r '.Snapshots| group_by(.VolumeSize)| map({size: .[0].VolumeSize, count: length}) | sort_by(.size)[] | [.size, .count|tostring] | join("\t")'
cloudformation
The idea of a relinker is pretty simple. It's basically an internal version of Google's URL shortner. The internal part is important because it's all about keeping things internal.
What we want to create at the end of this is the ability to use or local browser and hit http://go/ to get to the webui for relinker. Then, once we have a link loaded, we can simply type something like "go/s" in our browser and know that it'll go to wherever "s" is pointed at.
This works really well for local develoopment, and in many ways it's basically just a weird way of creating a bizare kind of bookmarking system that runs in the cloud. However, it gets really useful when we look at a few interesting aspects of why we would do this:
The trippy part about this is how we do the redirection. This document assumes that we have a macbook running VirtualBox where VBox is running a debian VM ( or similar ). You could probably do the same thing with docker-engine or similar. Basically we have to hack our local dns on the macbook as such:
127.0.0.1 go
Then we have a docker container on the debian VM that redirects :80 to our cloudy instance. We'll get into how this is wired up later. We use the local redir hack if we don't have IT buyin quite yet. This is often the case when we want to prove things out before we actually start playing with things.
cd web ./api.rb
Now we can play around with the UI.
amazon-ecs: Chef::Exceptions::Service
amazon-ecs: -------------------------
amazon-ecs: service[procps]: Service is not known to chkconfig.
amazon-ecs: Profile Summary: 48 successful controls, 4 control failures, 2 controls skipped
amazon-ecs: Test Summary: 117 successful, 6 failures, 2 skipped
amazon-ecs: × package-08: Install auditd (1 failed)
amazon-ecs: ✔ System Package audit should be installed
amazon-ecs: ✔ Audit Daemon Config log_file should cmp == "/var/log/audit/audit.log"
amazon-ecs: ✔ Audit Daemon Config log_format should cmp == "raw"
amazon-ecs: ✔ Audit Daemon Config flush should match /^INCREMENTAL|INCREMENTAL_ASYNC$/
amazon-ecs: × Audit Daemon Config max_log_file_action should cmp == "keep_logs"
amazon-ecs: Profile Summary: 52 successful controls, 0 control failures, 2 controls skipped
amazon-ecs: Test Summary: 123 successful, 0 failures, 2 skipped
cloudformation
Shatterdome is my attempt at teaching how to build a platform tool.
In the course of my career I've run into more than a few companies that are working on building their own internal platform tooling. In some cases this involves using terraform, and in those cases it's very clear that although terraform is a great tool, it has certain gaps that can't be addressed by it's particular methodology.
Terraform is a relativliy new player in the space of infrastructure orchistration and management. Overall it's a great tool for general use. When you use tf you have the advantage of locking your infrastructure developers into a very well defined declaritave space. This is hugely benificial when you have platform teams that either don't know much about how the cloud works, or they enjoy the security of knowing they have "bumpers" which keep them safe.
This body of work is aimed at breaking out of the mindset of the restrictive declarative space. Here we're going to explore how to use CloudFormation and ruby to create the least amount of code to create business value quickly. The goal here is to create something quick and slick that our customers ( usually internal developers ) can start using quickly.
It's entirely possible to take the idea of what's happening here and translate it to tf, and I highly encourage you to look into that. TF is like buying beer off the shelf, where as what we're doing here is more like the home-brewed craft beer experience. The outcome of this work will be a gem that we can package and ship to our customers so that they can easially and safely deploy and interact with the system that we've built. We want our customers to be able to integrate this work into their CI/CD workflows with as little effort on their part as possible.
I want to clearly state that this work isn't for everyone. If you're just looking to belt out some quick and dirty infrastructure bits to get people off your ass, then tf is your best bet. However, if you're looking to create something that is really outside of the box and absolutely a level above the average, this is probably going to be very fun for you. After all, some people long to craft the perfect beer, and others just want to get their buzz on.
Let's get crafting.
We'll start this work by setting up a workspace to develop our platform. Most of this work has been done in a project that I'm codenaming Shatterdome after my favorite movies Pacfiic Rim.
Eventually we're going to have a bin file that we can execute on the command line. We'll install this bin via the gem package. You can do the same kind of thing with python or just about any other language.
The bin is going to follow a command pattern in the form of noun verb. For example:
In this case the noun is cluster and the verb is create. We'll have two nouns to start with, but we might expand beyond that later.