Default grep options

When you’re searching a set of version-controlled files for a string with grep, particularly if it’s a recursive search, it can get very annoying to be presented with swathes of results from the internals of the hidden version control directories like .svn or .git, or include metadata you’re unlikely to have wanted in files like .gitmodules.

GNU grep uses an environment variable named GREP_OPTIONS to define a set of options that are always applied to every call to grep. This comes in handy when exported in your .bashrc file to set a “standard” grep environment for your interactive shell. Here’s an example of a definition of GREP_OPTIONS that excludes a lot of patterns which you’d very rarely if ever want to search with grep:

GREP_OPTIONS=
for pattern in .cvs .git .hg .svn; do
    GREP_OPTIONS="$GREP_OPTIONS --exclude-dir=$pattern
done
export GREP_OPTIONS

Note that --exclude-dir is a relatively recent addition to the options for GNU grep, but it should only be missing on very legacy GNU/Linux machines by now. If you want to keep your .bashrc file compatible, you could apply a little extra hackery to make sure the option is available before you set it up to be used:

GREP_OPTIONS=
if grep --help | grep -- --exclude-dir &>/dev/null; then
    for pattern in .cvs .git .hg .svn; do
        GREP_OPTIONS="$GREP_OPTIONS --exclude-dir=$pattern"
    done
fi
export GREP_OPTIONS

Similarly, you can ignore single files with --exclude. There’s also --exclude-from=FILE if your list of excluded patterns starts getting too long.

Other useful options available in GNU grep that you might wish to add to this environment variable include:

  • --color — On appropriate terminal types, highlight the pattern matches in output, among other color changes that make results more readable
  • -s — Suppresses error messages about files not existing or being unreadable; helps if you find this behaviour more annoying than useful.
  • -E, -F, or -P — Pick a favourite “mode” for grep; devotees of PCRE may find adding -P for grep‘s experimental PCRE support makes grep behave in a much more pleasing way, even though it’s described in the manual as being experimental and incomplete

If you don’t want to use GREP_OPTIONS, you could instead simply set up an alias:

alias grep='grep --exclude-dir=.git'

You may actually prefer this method as it’s essentially functionally equivalent, but if you do it this way, when you want to call grep without your standard set of options, you only have to prepend a backslash to its call:

$ \grep pattern file

Commenter Andy Pearce also points out that using this method can avoid some build problems where GREP_OPTIONS would interfere.

Of course, you could solve a lot of these problems simply by using ack … but that’s another post.

Counting with grep and uniq

A common idiom in Unix is to count the lines of output in a file or pipe with wc -l:

$ wc -l example.txt
43
$ ps -e | wc -l
97

Sometimes you want to count the number of lines of output from a grep call, however. You might do it this way:

$ ps -ef | grep apache | wc -l
6

But grep has built-in counting of its own, with the -c option:

$ ps -ef | grep -c apache
6

The above is more a matter of good style than efficiency, but another tool with a built-in counting option that could save you time is the oft-used uniq. The below example shows a use of uniq to filter a sorted list into unique rows:

$ ps -eo user= | sort | uniq
105
daemon
lp
mysql
nagios
postfix
root
snmp
tom
www-data

If it would be useful to know in this case how many processes were being run by each of these users, you can include the -c option for uniq:

$ ps -eo user= | sort | uniq -c
    1 105
    1 daemon
    1 lp
    1 mysql
    1 nagios
    2 postfix
    78 root
    1 snmp
    7 tom
    5 www-data

You could even sort this output itself to show the users running the most processes first with sort -rn:

$ ps -eo user= | sort | uniq -c | sort -rn
    78 root
    8 tom
    5 www-data
    2 postfix
    1 snmp
    1 nagios
    1 mysql
    1 lp
    1 daemon
    1 105

Incidentally, if you’re not counting results and really do just want a list of unique users, you can leave out the uniq and just add the -u flag to sort:

$ ps -eo user= | sort -u
105
daemon
lp
mysql
nagios
postfix
root
snmp
tom
www-data

The above means I actually find myself using uniq with no options quite seldom.

Unix as IDE: Files

This entry is part 2 of 7 in the series Unix as IDE.

One prominent feature of an IDE is a built-in system for managing files, both the elementary functions like moving, renaming, and deleting, and ones more specific to development, like compiling and checking syntax. It may also be useful to have operations on sets of files, such as finding files of a certain extension or size, or searching files for specific patterns. In this first article, I’ll explore some useful ways to use tools that will be familiar to most GNU/Linux users for the purposes of working with sets of files in a project.

Listing files

Using ls is probably one of the first commands an administrator will learn for getting a simple list of the contents of the directory. Most administrators will also know about the -a and -l switches, to show all files including dot files and to show more detailed data about files in columns, respectively.

There are other switches to GNU ls which are less frequently used, some of which turn out to be very useful for programming:

  • -t — List files in order of last modification date, newest first. This is useful for very large directories when you want to get a quick list of the most recent files changed, maybe piped through head or sed 10q. Probably most useful combined with -l. If you want the oldest files, you can add -r to reverse the list.
  • -X — Group files by extension; handy for polyglot code, to group header files and source files separately, or to separate source files from directories or build files.
  • -v — Naturally sort version numbers in filenames.
  • -S — Sort by filesize.
  • -R — List files recursively. This one is good combined with -l and piped through a pager like less.

Since the listing is text like anything else, you could, for example, pipe the output of this command into a vim process, so you could add explanations of what each file is for and save it as an inventory file or add it to a README:

$ ls -XR | vim -

This kind of stuff can even be automated by make with a little work, which I’ll cover in another article later in the series.

Finding files

Funnily enough, you can get a complete list of files including relative paths with no decoration by simply typing find with a . argument for the current directory, though you may want to pipe it through sort:

$ find . | sort
.
./Makefile
./README
./build
./client.c
./client.h
./common.h
./project.c
./server.c
./server.h
./tests
./tests/suite1.pl
./tests/suite2.pl
./tests/suite3.pl
./tests/suite4.pl

If you want an ls -l style listing, you can add -ls as the action to find results in GNU find(1):

$ find . -ls | sort -k11,11
1155096    4 drwxr-xr-x   4 tom      tom          4096 Feb 10 09:37 .
1155152    4 drwxr-xr-x   2 tom      tom          4096 Feb 10 09:17 ./build
1155155    4 -rw-r--r--   1 tom      tom          2290 Jan 11 07:21 ./client.c
1155157    4 -rw-r--r--   1 tom      tom          1871 Jan 11 16:41 ./client.h
1155159   32 -rw-r--r--   1 tom      tom         30390 Jan 10 15:29 ./common.h
1155153   24 -rw-r--r--   1 tom      tom         21170 Jan 11 05:43 ./Makefile
1155154   16 -rw-r--r--   1 tom      tom         13966 Jan 14 07:39 ./project.c
1155080   28 -rw-r--r--   1 tom      tom         25840 Jan 15 22:28 ./README
1155156   32 -rw-r--r--   1 tom      tom         31124 Jan 11 02:34 ./server.c
1155158    4 -rw-r--r--   1 tom      tom          3599 Jan 16 05:27 ./server.h
1155160    4 drwxr-xr-x   2 tom      tom          4096 Feb 10 09:29 ./tests
1155161    4 -rw-r--r--   1 tom      tom           288 Jan 13 03:04 ./tests/suite1.pl
1155162    4 -rw-r--r--   1 tom      tom          1792 Jan 13 10:06 ./tests/suite2.pl
1155163    4 -rw-r--r--   1 tom      tom           112 Jan  9 23:42 ./tests/suite3.pl
1155164    4 -rw-r--r--   1 tom      tom           144 Jan 15 02:10 ./tests/suite4.pl

Note that in this case I have to specify to sort that it should sort by the 11th column of output, the filenames; this is done with the -k option.

find has a complex filtering syntax all of its own; the following examples show some of the most useful filters you can apply to retrieve lists of certain files:

  • find . -name '*.c' — Find files with names matching a shell-style pattern. Use -iname for a case-insensitive search.
  • find . -path '*test*' — Find files with paths matching a shell-style pattern. Use -ipath for a case-insensitive search.
  • find . -mtime -5 — Find files edited within the last five days. You can use +5 instead to find files edited before five days ago.
  • find . -newer server.c — Find files more recently modified than server.c.
  • find . -type d — Find directories. For files, use -type f; for symbolic links, use -type l.

Note, in particular, that all of these can be combined, for example to find C source files edited in the last two days:

$ find . -name '*.c' -mtime -2

By default, the action find takes for search results is simply to list them on standard output, but there are several other useful actions:

  • -ls — Provide an ls -l style listing, as above (GNU find(1))
  • -delete — Delete matching files
  • -exec — Run an arbitrary command line on each file, replacing {} with the appropriate filename, and terminated by \;; for example:

    $ find . -name '*.pl' -exec perl -c {} \;
    

    You can use + as the terminating character instead if you want to put all of the results on one invocation of the command. One trick I find myself using often is using find to generate lists of files that I then edit in vertically split Vim windows:

    $ find . -name '*.c' -exec vim {} +
    

Earlier versions of Unix as IDE suggested the use of xargs with find results. In most cases this should not really be necessary, and it’s more robust to handle filenames with whitespace using -exec or a while read -r loop.

Searching files

More often than attributes of a set of files, however, you want to find files based on their contents, and it’s no surprise that grep, in particular grep -R, is useful here. This searches the current directory tree recursively for anything matching ‘someVar’:

$ grep -FR someVar .

Don’t forget the case insensitivity flag either, since by default grep works with fixed case:

$ grep -iR somevar .

Also, you can print a list of files that match without printing the matches themselves with grep -l:

$ grep -lR someVar .

If you write scripts or batch jobs using the output of the above, use a while loop with read to handle spaces and other special characters in filenames:

grep -lR someVar | while IFS= read -r file; do
    head "$file"
done

If you’re using version control for your project, this often includes metadata in the .svn, .git, or .hg directories. This is dealt with easily enough by excluding (grep -v) anything matching an appropriate fixed (grep -F) string:

$ grep -R someVar . | grep -vF .svn

Some versions of grep include --exclude and --exclude-dir options, which may be tidier.

With all this said, there’s a very popular alternative to grep called ack, which excludes this sort of stuff for you by default. It also allows you to use Perl-compatible regular expressions (PCRE), which are a favourite for many programmers. It has a lot of utilities that are generally useful for working with source code, so while there’s nothing wrong with good old grep since you know it will always be there, if you can install ack I highly recommend it. There’s a Debian package called ack-grep, and being a Perl script it’s otherwise very simple to install.

Unix purists might be displeased with my even mentioning a relatively new Perl script alternative to classic grep, but I don’t believe that the Unix philosophy or using Unix as an IDE is dependent on sticking to the same classic tools when alternatives with the same spirit that solve new problems are available.

File metadata

The file tool gives you a one-line summary of what kind of file you’re looking at, based on its extension, headers and other cues. This is very handy used with find when examining a set of unfamiliar files:

$ find . -exec file {} +
.:            directory
./hanoi:      Perl script, ASCII text executable
./.hanoi.swp: Vim swap file, version 7.3
./factorial:  Perl script, ASCII text executable
./bits.c:     C source, ASCII text
./bits:       ELF 32-bit LSB executable, Intel 80386, version ...

Matching files

As a final tip for this section, I’d suggest learning a bit about pattern matching and brace expansion in Bash, which you can do in my earlier post entitled Bash shell expansion.

All of the above make the classic UNIX shell into a pretty powerful means of managing files in programming projects.

Edited April 2017 to use POSIX-compatible examples for most of the find(1) invocations.

Elegant Awk usage

For many system administrators, Awk is used only as a way to print specific columns of data from programs that generate columnar output, such as netstat or ps. For example, to get a list of all the IP addresses and ports with open TCP connections on a machine, one might run the following:

# netstat -ant | awk '{print $5}'

This works pretty well, but among the data you actually wanted it also includes the fifth word of the opening explanatory note, and the heading of the fifth column:

and
Address
0.0.0.0:*
205.188.17.70:443
172.20.0.236:5222
72.14.203.125:5222

There are varying ways to deal with this.

Matching patterns

One common way is to pipe the output further through a call to grep, perhaps to only include results with at least one number:

# netstat -ant | awk '{print $5}' | grep '[0-9]'

In this case, it’s instructive to use the awk call a bit more intelligently by setting a regular expression which the applicable line must match in order for that field to be printed, with the standard / characters as delimiters. This eliminates the need for the call to grep:

# netstat -ant | awk '/[0-9]/ {print $5}'

We can further refine this by ensuring that the regular expression should only match data in the fifth column of the output, using the ~ operator:

# netstat -ant | awk '$5 ~ /[0-9]/ {print $5}'

Skipping lines

Another approach you could take to strip the headers out might be to use sed to skip the first two lines of the output:

# netstat -ant | awk '{print $5}' | sed 1,2d

However, this can also be incorporated into the awk call, using the NR variable and making it part of a conditional checking the line number is greater than two:

# netstat -ant | awk 'NR>2 {print $5}'

Combining and excluding patterns

Another common idiom on systems that don’t have the special pgrep command is to filter ps output for a string, but exclude the grep process itself from the output with grep -v grep:

# ps -ef | grep apache | grep -v grep | awk '{print $2}'

If you’re using Awk to get columnar data from the output, in this case the second column containing the process ID, both calls to grep can instead be incorporated into the awk call:

# ps -ef | awk '/apache/ && !/awk/ {print $2}'

Again, this can be further refined if necessary to ensure you’re only matching the expressions against the command name by specifying the field number for each comparison:

# ps -ef | awk '$8 ~ /apache/ && $8 !~ /awk/ {print $2}'

If you’re used to using Awk purely as a column filter, the above might help to increase its utility for you and allow you to write shorter and more efficient command lines. The Awk Primer on Wikibooks is a really good reference for using Awk to its fullest for the sorts of tasks for which it’s especially well-suited.