Shell script and AWK are very complementary languages. AWK was designed from its very beginnings at Bell Labs as a pattern-action language for short programs, ideally one or two lines long. It was intended to be used on the Unix shell interactive command line, or in shell scripts. Its feature set filled out some functionality that shell script at the time lacked, and often still lacks, as is the case with floating point numbers; it thereby (indirectly) brings much of the C language’s expressive power to the shell.
It’s therefore both common and reasonable to see AWK one-liners in shell
scripts for data processing where doing the same in shell is unwieldy or
impossible, especially when floating point operations or data delimiting are
involved. While AWK’s full power is in general tragically underused, most
shell script users and developers know about one of its most useful properties:
selecting a single column from whitespace-delimited data. Sometimes,
cut(1)
doesn’t, uh, cut it.
In order for one language to cooperate with another usefully via embedded programs in this way, data of some sort needs to be passed between them at runtime, and here there are a few traps with syntax that may catch out unwary shell programmers. We’ll go through a simple example showing the problems, and demonstrate a few potential solutions.
Easy: Fixed data
Embedded AWK programs in shell scripts work great when you already know
before runtime what you want your patterns for the pattern-action pairs to
be. Suppose our company has a vendor-supplied program that returns temperature
sensor data for the server room, and we want to run some commands for any and
all rows registering over a certain threshold temperature. The output for the
existing server-room-temps
command might look like this:
$ server-room-temps
ID Location Temperature_C
1 hot_aisle_1 27.9
2 hot_aisle_2 30.3
3 cold_aisle_1 26.0
4 cold_aisle_2 25.2
5 outer 23.9
The task for the monitoring script is simple: get a list of all the locations where the temperature is above 28°C. If there are any such locations, we need to email the administrator the full list. Easy! It looks like every introductory AWK example you’ve ever seen—it could be straight out of the book. Let’s type it up on the shell to test it:
$ server-room-temps | awk 'NR > 1 && $3 > 28 {print $2}'
hot_aisle_2
That looks good. The script might end up looking something like this:
#!/bin/sh
alerts=/var/cache/temps/alerts
server-room-temps |
awk 'NR > 1 && $3 > 28 {print $2}' > "$alerts" || exit
if [ -s "$alerts" ] ; then
mail -s 'Temperature alert' sysadmin < "$alerts"
fi
So, after writing the alerts data file, we test if with [ -s ... ]
to see
whether it’s got any data in it. If it does, we send it all to the
administrator with mail(1)
. Done!
We set that running every few minutes with cron(8)
or systemd.timer(5)
, and
we have a nice stop-gap solution until the lazy systems administrator gets
around to fixing the Nagios server. He’s probably just off playing
ADOM again…
Hard: runtime data
A few weeks later, our sysadmin still hasn’t got the Nagios server running, because his high elf wizard is about to hit level 50, and there’s a new request from the boss: can we adjust the script so that it accepts the cutoff temperature data as an argument, and other departments can use it? Sure, why not. Let’s mock that up, with a threshold of, let’s say, 25.5°C.
$ server-room-temps > test-data
$ threshold=25.5
$ awk 'NR > 1 && $3 > $threshold {print $2}' test-data
hot_aisle_1
hot_aisle_2
Wait, that’s not right. There are three lines with temperatures over 25.5°C,
not two. Where’s cold_aisle_1
?
Looking at the code more carefully, you realize that you assumed your shell variable would be accessible from within the AWK program, when of course, it isn’t; AWK’s variables are independent of shell variables. You don’t know why the hell it’s showing those two rows, though…
Maybe we need double quotes?
$ awk "NR > 1 && $3 > $threshold {print $2}" test-data
awk: cmd. line:1: NR > 1 && > 25.5 {print}
awk: cmd. line:1: ^ syntax error
Hmm. Nope. Maybe we need to expand the variable inside the quotes?
$ awk 'NR > 1 && $3 > "$threshold" {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1
cold-aisle-2
outer
That’s not right, either. It seems to have printed all the locations, as if it didn’t test the threshold at all.
Maybe it should be outside the single quotes?
$ awk 'NR > 1 && $3 > '$threshold' {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1
The results look right, now … ah, but wait, we still need to quote it to stop spaces expanding…
$ awk 'NR > 1 && $3 > '"$threshold"' {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1
Cool, that works. Let’s submit it to the security team and go to lunch.
Caught out
To your surprise, the script is rejected. The security officer says you have an unescaped variable that allows arbitrary code execution. What? Where? It’s just AWK, not SQL…!
To your horror, the security officer demonstrates:
$ threshold='0;{system("echo rm -fr /*");exit}'
$ echo 'NR > 1 && $3 > '"$threshold"' {print $2}'
NR > 1 && $3 > 0;{system("echo rm -fr /*");exit} {print $2}
$ awk 'NR > 1 && $3 > '"$threshold"' {print $2}' test-data
rm -fr /bin /boot /dev /etc /home /initrd.img ...
Oh, hell… if that were installed, and someone were able to set threshold
to
an arbitrary value, they could execute any AWK code, and thereby shell
script, that they wanted to. It’s AWK injection! How embarrassing—good
thing that was never going to run as root
(…right?) Back to the drawing
board …
Validating the data
One approach that might come readily to mind is to ensure that no unexpected
characters appear in the value. We could use a case
statement before
interpolating the variable into the AWK program to check it contains no
characters outside digits and a decimal:
case $threshold in
*[!0-9.]*) exit 2 ;;
esac
That works just fine, and it’s appropriate to do some data validation at the opening of the script, anyway. It’s certainly better than leaving it as it was. But we learned this lesson with PHP in the 90s; you don’t just filter on characters, or slap in some backslashes—that’s missing the point. Ideally, we need to safely pass the data into the AWK process without ever parsing it as AWK code, sanitized or nay, so the situation doesn’t arise in the first place.
Environment variables
The shell and your embedded AWK program may not share the shell’s local
variables, but they do share environment variables, accessible in AWK’s
ENVIRON
array. So, passing the threshold in as an environment variable
works:
$ THRESHOLD=25.5
$ export THRESHOLD
$ awk 'NR > 1 && $3 > ENVIRON["THRESHOLD"] {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1
Or, to be a little cleaner:
$ THRESHOLD=25.5 \
awk 'NR > 1 && $3 > ENVIRON["THRESHOLD"] {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1
This is already much better. AWK will parse our data only as a variable, and
won’t try to execute anything within it. The only snag with this method is
picking a name; make sure that you don’t overwrite another, more important
environment variable, like PATH
, or LANG
…
Another argument
Passing the data as another argument and then reading it out of the ARGV
array works, too:
$ awk 'BEGIN{ARGC--} NR > 1 && $3 > ARGV[2] {print $2}' test-data 25.5
This method is also safe from arbitrary code execution, but it’s still somewhat
awkward because it requires us to decrease the argument count ARGC
by one so
that AWK doesn’t try to process a file named “25.5” and end up upset when it’s
not there. AWK arguments can mean whatever you need them to mean, but unless
told otherwise, AWK generally assumes they are filenames, and will attempt to
iterate through them for lines of data to chew on.
Here’s another way that’s very similar; we read the threshold from the second
argument, and then blank it out in the ARGV
array:
$ awk 'BEGIN{threshold=ARGV[2];ARGV[2]=""}
NR > 1 && $3 > threshold {print $2}' test-data 25.5
AWK won’t treat the second argument as a filename, because it’s blank by the time it processes it.
Pre-assigned variables
There are two lesser-known syntaxes for passing data into AWK that allow you
safely to assign variables at runtime. The first is to use the -v
option:
$ awk -v threshold="$threshold" \
'NR > 1 && $3 > threshold {print $2}' \
test-data
Another, perhaps even more obscure, is to set them as arguments before the
filename data, using the var=value
syntax:
$ awk 'NR > 1 && $3 > threshold {print $2}' \
threshold="$threshold" test-data
Note that in both cases, we still quote the $threshold
expansion; this is
because the shell is expanding the value before we pass it in.
The difference between these two syntaxes is when the variable assignment
occurs. With -v
, the assignment happens straight away, before reading any
data from the input sources, as if it were in the BEGIN
block of the program.
With the argument form, it happens when the program’s data processing reaches
that argument. The upshot of that is that you could test several files with
several different temperatures in one hit, if you wanted to:
$ awk 'NR > 1 && $3 > threshold {print $2}' \
threshold=25.5 test-data-1 threshold=26.0 test-data-2
Both of these assignment syntaxes are standardized in POSIX awk
.
These are my preferred methods for passing runtime data; they require no argument count munging, avoid the possibility of trampling on existing environment variables, use AWK’s own variable and expression syntax, and most importantly, the chances of anyone reading the script being able to grasp what’s going on are higher. You can thereby avoid a mess of quoting and back-ticking that often plagues these sorts of embedded programs.
Safety not guaranteed
If you take away only one thing from this post, it might be: don’t interpolate shell variables in AWK programs, because it has the same fundamental problems as interpolating data into query strings in PHP. Pass the data in safely instead, using either environment variables, arguments, or AWK variable assignments. Keeping this principle in mind will serve you well for other embedded programs, too; stop thinking in terms of escaping and character whitelists, and start thinking in terms of passing the data safely in the first place.