Fun with Bash
I was given a fairly mundane task at work and when this happens I try and find ways to make it interesting. I will usually find a way to learn something new to do the task so I can always get something out of the work I do, even if it's dull.
I would quite like to get better at using the legendary tools available to me on the command line. I have never felt super confident doing text processing there, often giving up and just going to my editor instead.
I am not going to exhaustively document every command I do, if you're curious about the details consult man $command
just like how I did! This post will just be illustrating how by doing a little research you can quickly do some rad text processing.
The task
I am trying to analyse how much memory we are provisioning compared to actual usage for some of our services to see if our AWS bill can be cut down a bit.
The data
I thought it would be fun (??) to use CSV, as it feels like CSV like data. Here is a sample of the data I captured.
name, usage (mb), allocated (mb), CPU %, containers
dispatcher, 150, 512, 40, 10
assembler, 175, 512, 75, 10
matcher, 85, 512, 15, 10
user-profile, 128, 512, 40, 5
profile-search, 220, 512, 80, 10
reporter, 90, 512, 40, 10
mailgun-listener, 90, 512, 10, 5
unsubscribe, 64, 512, 3, 5
bounce, 8, 128, 0.5, 3
legacy-reporting, 30, 512, 15, 3
content-store, 80, 256, 30, 10
legacy-alert-poller, 64, 256, 1, 1
migrator, 80, 256, 10, 5
entitlements-update, 150, 256, 70, 3
Display it nice
This is nice and easy column -s, -t < data.csv
-t
determines the number of columns the input contains and combined with -s
specifies a set of characters to delimit by. If you dont specify a -s
it defaults to using space.
name usage (mb) allocated (mb) CPU % containers
dispatcher 150 512 40 10
assembler 175 512 75 10
matcher 85 512 15 10
user-profile 128 512 40 5
profile-search 220 512 80 10
reporter 90 512 40 10
mailgun-listener 90 512 10 5
unsubscribe 64 512 3 5
bounce 8 128 0.5 3
legacy-reporting 30 512 15 3
content-store 80 256 30 10
legacy-alert-poller 64 256 1 1
migrator 80 256 10 5
entitlements-update 150 256 70 3
Sorting by usage
cat data.csv | sort -n --field-separator=',' --key=2 | column -s, -t
name usage (mb) allocated (mb) CPU % containers
bounce 8 128 0.5 3
legacy-reporting 30 512 15 3
legacy-alert-poller 64 256 1 1
unsubscribe 64 512 3 5
content-store 80 256 30 10
migrator 80 256 10 5
matcher 85 512 15 10
mailgun-listener 90 512 10 5
reporter 90 512 40 10
user-profile 128 512 40 5
dispatcher 150 512 40 10
entitlements-update 150 256 70 3
assembler 175 512 75 10
profile-search 220 512 80 10
--key=2
means sort by the second column
Using awk to figure out the memory differences
What we're really interested in is the difference between the amount of memory provisioned vs usage.
awk -F , '{print $1, $3-$2}' data.csv
Let's pipe that into column again
awk -F , '{print $1, $3-$2}' data.csv | column -t
name 0
dispatcher 362
assembler 337
matcher 427
user-profile 384
profile-search 292
reporter 422
mailgun-listener 422
unsubscribe 448
bounce 120
legacy-reporting 482
content-store 176
legacy-alert-poller 192
migrator 176
entitlements-update 106
This is nice but it would be good to ignore the first line.
awk -F , '{print $1, $3-$2}' data.csv | tail -n +2
tail -n X
prints the last X
lines, the plus inverts it so its the first X
lines.
Sort mk 2
Now we have some memory differences it would be handy to sort them so we can address the most inefficient configurations first
awk -F , '{print $1, $3-$2}' data.csv | tail -n +2 | sort -r -n --key=2
And of course use column again to make it look pretty
awk -F , '{print $1, $3-$2}' data.csv | tail -n +2 | sort -r -n --key=2 | column -t
legacy-reporting 482
unsubscribe 448
matcher 427
reporter 422
mailgun-listener 422
user-profile 384
dispatcher 362
assembler 337
profile-search 292
legacy-alert-poller 192
migrator 176
content-store 176
bounce 120
entitlements-update 106
WTF
There it is! The utterly indecipherable bash command that someone reads 6 months later and scratches their head. In fact there has been 2 weeks since I wrote the first draft of this and I look at the final command and weep.
It is very easy to throw up your hands when you see a shell script that doesn't make sense but there are things you can do.
Remember that the process will usually start small like we have here, starting with one command, piping it into another, into another. This gives the lazy dev like me the perception that it is a complicated command but all it really is a set of steps to process some data. So if you're struggling you can wind back some of the steps for yourself by just deleting some of the steps and see what happens.
If it is an important business process that needs to be understood for a long time, you're probably better off writing it in a language where you can write some automated tests around it.
But a lot of the work we do does involve doing things that are ad-hoc and doesn't reside in "the codebase" where you can easily get into a TDD rhythm to accomplish something shiny. Often you have to do something a little boring and sometimes the tools available on your computer can really help you out. They're so old and well established that you can find tons of documentation and tips so dive in!
If you need more reasons to get to know the shell better read how command line tools can be up to 235x faster than your hadoop cluster