Latest Posts

Using helm with Black Box

 helm automation blackbox

Overview

Chart

In most cases we can monitor things just by using simple tools that already exist like k8s event logs. Or by doing things like checking for the number of running pods on a deployment and alerting on < 0. This would indicate that we have 0 running pods in a deployment or stateful set. In some cases we want to do some more specific things like looking at endpoint status or synthetic testing.

In our case we’re using a shared_cluster which is a k8s platform maintained by another team. This shared_cluster is highly leveraged around security, which means by default all endpoints created by Istio Virtual Services ( in most cases AWS NLB -> k8s services ) are protected by both OAuth and a whitelisted firewall.

The use case for this work was to create a monitoring system which could alert on a few important things about each of our running services while keeping in mind the highly secure nature of the cluster:

  • DNS entries are configured correctly
    • This is usually out of scope for most monitoring systems
  • Service endpoints are protected by both whitelisted firewall rules and OAuth
  • Specific endpoints were whitelisted on the firewall but had no OAuth protection
  • Endpoints where returning specific responses for both http and gRPC
  • http endpoints properly redirected to https services
  • TLS is properly resolved on all services

Example

Workflow:

  • Create a deployment which runs:
  • Telegraf scraps the pg-exporter and bb-exporter ports for data
  • Telegraf exposes a /metrics endpoint which is scrapped by our WaveFront collector and sent to WaveFront

Module: OAuth blocked

In this case what we want to do is ensure that an enpoint is blocking requests that have no oauth information. This can help us ensure that our endpoints are properly setup for OAuth. The use case here is that someone at some point might accidentally remove or alter the oauth config and cause the authentication to stop working, leaving our endpoints exposed without any protection.

This could happen for a variety of reasons, but it’s always the reasons we don’t see coming that do the most harm.

This is our safety valve to make sure that we’re constantly testing and validating our authentication status.

[[inputs.prometheus]]
urls = [ "http://localhost:9115/probe?module=https_oauth_blocked&target=https://css-myapp-dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.w3n.io"]
      https_oauth_blocked:
        prober: http
        timeout: 5s
        http:
          valid_http_versions: [ "HTTP/1.1", "HTTP/2.0" ]
          valid_status_codes: [ 302 ]
          method: GET
          no_follow_redirects: true
          fail_if_ssl: false
          fail_if_not_ssl: true
          fail_if_header_not_matches:
            - header: location
              regexp: 'login.windows.net'
          tls_config:
            insecure_skip_verify: false
          preferred_ip_protocol: "ip4"
          ip_protocol_fallback: false

All pages protected with OAuth validation will return a 302 to login.windows.net.

DNS checks

In this case we are checking more than just endpoints. Here we’re validating that our DNS records are setup and continue to be setup correctly. This is like a constantly running validation of our externally facing system.

This is another saftey check to make sure that our systems stay configured the way we expect them to be configured.

It’s not impossible for something to get knocked out of alignment somewhere. It has happened, and this is what “reliability” means in SRE.

      css.dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.mydomain.com:
        prober: dns
        timeout: 5s
        dns:
          query_name: "css.dev.cluster-0-non-prod-us-east-1.aws.infra.shared_cluster.mydomain.com"
          query_type: "A"
          transport_protocol: 'udp'
          preferred_ip_protocol: "ip4"

This will check for a DNS entry that is configured for an A host.

In some cases we’ve had people outright remove the Istio Virtual Service with a bad change to the helm chart which knocked this out of existence for a bit. Not a huge deal, but we were able to detect the change quickly and push out a fix once we got the alarm.

Conclusion

Early alarming like this is critical when we want to enable developers to move fast and build things.

I believe the role of DevOps and/or SRE is to build things like this to catch the common failures, alert, and remediate quickly. This stands apart from other modes that might use something like a change management process to gate and review every possible change that goes to production.

I will always lean on less paper work and more automation because moving at the speed of business requires this kind of agile thinking.

Read More →

Terraform workspaces wit gitlab CICD

 terraform gitlab cicd

Overview

Using terraform workspaces with a little python code and gitlab cicd pipelines to create a dynamic, interesting pipeline.

Use case

Company: Renovo

We wanted to create a single terraform manifest that could express complicated infrastructure requirements with environmental deltas. A single, simple implementation, but with each environment express uniqueness based on the environmental needs.

➜  raleigh git:(master) terraform workspace list
* default

➜  raleigh git:(master) *terraform workspace new stage*
Created and switched to workspace "stage"!

You're now on a new, empty workspace. Workspaces isolate their state,
so if you run "terraform plan" Terraform will not see any existing state
for this configuration.
➜  raleigh git:(master) terraform workspace new production
Created and switched to workspace "production"!

You're now on a new, empty workspace. Workspaces isolate their state,
so if you run "terraform plan" Terraform will not see any existing state
for this configuration.
➜  raleigh git:(master) terraform workspace list          
  default
* production
  stage

Read More →

JQ tricks with AWS

 aws jq

Overview

Handy tricks for pulling out AWS data. From time to time I find myself needing some quick and dirty reports detailing aspects of a given AWS account. These helper scripts have helped me in my cunsulting work as no two clients are ever using the same set of tools, and as a cunsultant I often find myself retrained by security contraints.

In general most of my code ( be it ruby, or jq ) uses some kind of simple cache mechanism. In most of these examples I'm making the aws cli call and passing the data to a file. I then use the file with jq. This helps me tune and debug the data as needed. I can also, obviously, reuse the data for different queries. Some day I might get around to making this a little more sophisticated, but for now I usually run the script once, then comment the line out and run it again as many times as is needed.

EC2 Instances

This gives me a mapping of the instance types used. This is most often useful in environments where we're using mostly static instance builds without any clustering ( EKS, ECS ) or ASG's. This is a sure sign the the client does not have a clean way to deploy code to newly created instances, which is a huge red flag.

aws ec2 describe-instances --query "Reservations[].Instances[].{InstanceType:InstanceType}" > /tmp/instance_types.json
cat /tmp/instance_types.json | jq -r 'group_by(.InstanceType) | map({type: .[0].InstanceType, count: length}) | sort_by(.count)[] | [.type, .count|tostring] | join("\t")'

EBS Blocks

This is a great way to get a sense for the size and usage of EBS blocks.

#!/bin/bash

aws ec2 describe-volumes > /tmp/volumes.json

## Report total size of all ebs volumes
echo "---Total used---"
cat /tmp/volumes.json|jq '.Volumes|reduce .[].Size as $item (0; . + $item)'

## Report breakdown of types.
echo "---Types---"
cat /tmp/volumes.json|jq -r '.Volumes| group_by(.VolumeType)| map({volume: .[0].VolumeType, count: length}) | sort_by(.count)[] | [.volume, .count|tostring] | join("\t")'

## Report on size of volumes
echo "---Size (size in GiB / number )---"
cat /tmp/volumes.json|jq -r '.Volumes| group_by(.Size)| map({size: .[0].Size, count: length}) | sort_by(.size)[] | [.size, .count|tostring] | join("\t")'

echo "---Usage---"
cat /tmp/volumes.json|jq -r '.Volumes| group_by(.State)| map({state: .[0].State, count: length}) | sort_by(.count)[] | [.state, .count|tostring] | join("\t")'

Snapshots

I've seen cases where Lambda functions are forgotten about and snapshots end up piling up over time. It's usually not a big deal since snaps are cheap, but sometimes handy to know.

#!/bin/bash

aws ec2 describe-snapshots --owner-id=349250784145 > /tmp/snapshots.json

## Report total size of all ebs volumes
echo "---Total used---"
cat /tmp/snapshots.json|jq '.Snapshots|reduce .[].VolumeSize as $item (0; . + $item)'

## Report on size of volumes
echo "---Size (size in GiB / count)---"
cat /tmp/snapshots.json|jq -r '.Snapshots| group_by(.VolumeSize)| map({size: .[0].VolumeSize, count: length}) | sort_by(.size)[] | [.size, .count|tostring] | join("\t")'

Read More →

Relinker hackery

 cloudformation

Overview

The idea of a relinker is pretty simple. It's basically an internal version of Google's URL shortner. The internal part is important because it's all about keeping things internal.

What we want to create at the end of this is the ability to use or local browser and hit http://go/ to get to the webui for relinker. Then, once we have a link loaded, we can simply type something like "go/s" in our browser and know that it'll go to wherever "s" is pointed at.

This works really well for local develoopment, and in many ways it's basically just a weird way of creating a bizare kind of bookmarking system that runs in the cloud. However, it gets really useful when we look at a few interesting aspects of why we would do this:

  1. It proves out the container story, which can be useful for companies that are new to the containerization journey
  2. It allows for something called telepresence which allows us to run the local container with the remote database connection.
  3. If we get the IT folks on board, they can wire in an office DNS entry so that we don't need a local docker container to redirect to the cloud service.

The trippy part about this is how we do the redirection. This document assumes that we have a macbook running VirtualBox where VBox is running a debian VM ( or similar ). You could probably do the same thing with docker-engine or similar. Basically we have to hack our local dns on the macbook as such:

127.0.0.1 go

Then we have a docker container on the debian VM that redirects :80 to our cloudy instance. We'll get into how this is wired up later. We use the local redir hack if we don't have IT buyin quite yet. This is often the case when we want to prove things out before we actually start playing with things.

Layout

  • Mongodb for storing stuff. We assume, at least for now, that this is going to be transient, which is fine.
  • Relinker docker container that connects to the mongodb container.
  • ECS because containers are awesome.
  • Local redir container running in debian.

Plan

  1. First, we'll set up our local environment using docker-compose and the like.
  2. Deploy ECS cluster
  3. Deploy mongodb and relinker service.
  4. Wire everything up end to end

Build locally

 cd web
 ./api.rb

Now we can play around with the UI.

    amazon-ecs:     Chef::Exceptions::Service
    amazon-ecs:     -------------------------
    amazon-ecs:     service[procps]: Service is not known to chkconfig.
    amazon-ecs: Profile Summary: 48 successful controls, 4 control failures, 2 controls skipped
    amazon-ecs: Test Summary: 117 successful, 6 failures, 2 skipped
    amazon-ecs:   ×  package-08: Install auditd (1 failed)
    amazon-ecs:      ✔  System Package audit should be installed
    amazon-ecs:      ✔  Audit Daemon Config log_file should cmp == "/var/log/audit/audit.log"
    amazon-ecs:      ✔  Audit Daemon Config log_format should cmp == "raw"
    amazon-ecs:      ✔  Audit Daemon Config flush should match /^INCREMENTAL|INCREMENTAL_ASYNC$/
    amazon-ecs:      ×  Audit Daemon Config max_log_file_action should cmp == "keep_logs"
    amazon-ecs: Profile Summary: 52 successful controls, 0 control failures, 2 controls skipped
    amazon-ecs: Test Summary: 123 successful, 0 failures, 2 skipped

Path

  • /shatterdome
    • /clusters
      • /cluster_type_name

Read More →

Shatterdome

 cloudformation

Overview

Shatterdome is my attempt at teaching how to build a platform tool.

In the course of my career I've run into more than a few companies that are working on building their own internal platform tooling. In some cases this involves using terraform, and in those cases it's very clear that although terraform is a great tool, it has certain gaps that can't be addressed by it's particular methodology.

Terraform is a relativliy new player in the space of infrastructure orchistration and management. Overall it's a great tool for general use. When you use tf you have the advantage of locking your infrastructure developers into a very well defined declaritave space. This is hugely benificial when you have platform teams that either don't know much about how the cloud works, or they enjoy the security of knowing they have "bumpers" which keep them safe.

This body of work is aimed at breaking out of the mindset of the restrictive declarative space. Here we're going to explore how to use CloudFormation and ruby to create the least amount of code to create business value quickly. The goal here is to create something quick and slick that our customers ( usually internal developers ) can start using quickly.

It's entirely possible to take the idea of what's happening here and translate it to tf, and I highly encourage you to look into that. TF is like buying beer off the shelf, where as what we're doing here is more like the home-brewed craft beer experience. The outcome of this work will be a gem that we can package and ship to our customers so that they can easially and safely deploy and interact with the system that we've built. We want our customers to be able to integrate this work into their CI/CD workflows with as little effort on their part as possible.

I want to clearly state that this work isn't for everyone. If you're just looking to belt out some quick and dirty infrastructure bits to get people off your ass, then tf is your best bet. However, if you're looking to create something that is really outside of the box and absolutely a level above the average, this is probably going to be very fun for you. After all, some people long to craft the perfect beer, and others just want to get their buzz on.

Let's get crafting.

Workspace

We'll start this work by setting up a workspace to develop our platform. Most of this work has been done in a project that I'm codenaming Shatterdome after my favorite movies Pacfiic Rim.

Eventually we're going to have a bin file that we can execute on the command line. We'll install this bin via the gem package. You can do the same kind of thing with python or just about any other language.

The bin is going to follow a command pattern in the form of noun verb. For example:

krogebry@ubuntu-secure:~$ shatterdome cluster create

In this case the noun is cluster and the verb is create. We'll have two nouns to start with, but we might expand beyond that later.

  1. cluster: this is essentially the ECS cluster, with associated ASG, and CW triggers.
  2. service: a service is a collection of things that we're going to run on a cluster.

Cluster

  • AutoScale group
  • IAM policies
  • ECS cluster

Service

  • ECS service
  • ECS tasks
  • ALB or NLB

Read More →