Fun with Bash

20 March 2017

I was given a fairly mundane task at work and when this happens I try and find ways to make it interesting. I will usually find a way to learn something new to do the task so I can always get something out of the work I do, even if it's dull.

I would quite like to get better at using the legendary tools available to me on the command line. I have never felt super confident doing text processing there, often giving up and just going to my editor instead.

I am not going to exhaustively document every command I do, if you're curious about the details consult man $command just like how I did! This post will just be illustrating how by doing a little research you can quickly do some rad text processing.

The task

I am trying to analyse how much memory we are provisioning compared to actual usage for some of our services to see if our AWS bill can be cut down a bit.

The data

I thought it would be fun (??) to use CSV, as it feels like CSV like data. Here is a sample of the data I captured.

name, usage (mb), allocated (mb), CPU %, containers
dispatcher, 150, 512, 40, 10
assembler, 175, 512, 75, 10
matcher, 85, 512, 15, 10
user-profile, 128, 512, 40, 5
profile-search, 220, 512, 80, 10
reporter, 90, 512, 40, 10 
mailgun-listener, 90, 512, 10, 5
unsubscribe, 64, 512, 3, 5
bounce, 8, 128, 0.5, 3
legacy-reporting, 30, 512, 15, 3
content-store, 80, 256, 30, 10
legacy-alert-poller, 64, 256, 1, 1
migrator, 80, 256, 10, 5
entitlements-update, 150, 256, 70, 3

Display it nice

This is nice and easy column -s, -t < data.csv

-t determines the number of columns the input contains and combined with -s specifies a set of characters to delimit by. If you dont specify a -s it defaults to using space.

name                  usage (mb)   allocated (mb)   CPU %   containers
dispatcher            150          512              40      10
assembler             175          512              75      10
matcher               85           512              15      10
user-profile          128          512              40      5
profile-search        220          512              80      10
reporter              90           512              40      10 
mailgun-listener      90           512              10      5
unsubscribe           64           512              3       5
bounce                8            128              0.5     3
legacy-reporting      30           512              15      3
content-store         80           256              30      10
legacy-alert-poller   64           256              1       1
migrator              80           256              10      5
entitlements-update   150          256              70      3

Sorting by usage

cat data.csv | sort -n --field-separator=',' --key=2 | column -s, -t

name                  usage (mb)   allocated (mb)   CPU %   containers
bounce                8            128              0.5     3
legacy-reporting      30           512              15      3
legacy-alert-poller   64           256              1       1
unsubscribe           64           512              3       5
content-store         80           256              30      10
migrator              80           256              10      5
matcher               85           512              15      10
mailgun-listener      90           512              10      5
reporter              90           512              40      10 
user-profile          128          512              40      5
dispatcher            150          512              40      10
entitlements-update   150          256              70      3
assembler             175          512              75      10
profile-search        220          512              80      10

--key=2 means sort by the second column

Using awk to figure out the memory differences

What we're really interested in is the difference between the amount of memory provisioned vs usage.

awk -F , '{print $1, $3-$2}' data.csv

Let's pipe that into column again

awk -F , '{print $1, $3-$2}' data.csv | column -t

name                 0
dispatcher           362
assembler            337
matcher              427
user-profile         384
profile-search       292
reporter             422
mailgun-listener     422
unsubscribe          448
bounce               120
legacy-reporting     482
content-store        176
legacy-alert-poller  192
migrator             176
entitlements-update  106

This is nice but it would be good to ignore the first line.

awk -F , '{print $1, $3-$2}' data.csv | tail -n +2

tail -n X prints the last X lines, the plus inverts it so its the first X lines.

Sort mk 2

Now we have some memory differences it would be handy to sort them so we can address the most inefficient configurations first

awk -F , '{print $1, $3-$2}' data.csv | tail -n +2 | sort -r -n --key=2

And of course use column again to make it look pretty

awk -F , '{print $1, $3-$2}' data.csv | tail -n +2 | sort -r -n --key=2 | column -t

legacy-reporting     482
unsubscribe          448
matcher              427
reporter             422
mailgun-listener     422
user-profile         384
dispatcher           362
assembler            337
profile-search       292
legacy-alert-poller  192
migrator             176
content-store        176
bounce               120
entitlements-update  106

WTF

There it is! The utterly indecipherable bash command that someone reads 6 months later and scratches their head. In fact there has been 2 weeks since I wrote the first draft of this and I look at the final command and weep.

It is very easy to throw up your hands when you see a shell script that doesn't make sense but there are things you can do.

Remember that the process will usually start small like we have here, starting with one command, piping it into another, into another. This gives the lazy dev like me the perception that it is a complicated command but all it really is a set of steps to process some data. So if you're struggling you can wind back some of the steps for yourself by just deleting some of the steps and see what happens.

If it is an important business process that needs to be understood for a long time, you're probably better off writing it in a language where you can write some automated tests around it.

But a lot of the work we do does involve doing things that are ad-hoc and doesn't reside in "the codebase" where you can easily get into a TDD rhythm to accomplish something shiny. Often you have to do something a little boring and sometimes the tools available on your computer can really help you out. They're so old and well established that you can find tons of documentation and tips so dive in!

If you need more reasons to get to know the shell better read how command line tools can be up to 235x faster than your hadoop cluster