du | grep and regular expressions

I love regular expressions. I’m not an expert, and I’ve never devoted too much explicit effort to learn them, but little by little I’ve finally learned to use them. The key to this implicit learning is simple, if you have a problem that smells to regular expressions, then write one to solve it… Ah! add that eclipse search and replace dialog supports regexps, which comes very handy to make tests.

du_grep_find1

And also has a nice autocomplete feature:

du_grep_find2

This is the story of one of those regexps cases!

I’m a hard disk “free space eater”, so I was recently running low on disk space. This is common in computers I use, no matter how big my hard drives are, I always suffer of free space shortages. So, I was in the need to check how big some files and directories where in a given path in order to wisely delete some of them.

Easy? Think again. In your comfy Windows, Mac or Linux with Gnome/KDE equipped with a fancy file manager it may be an easy task. But in my hard and weird Fluxbox the task is not as straight/direct/fast as you might think.

Now, don’t get me wrong, from the point of view of many Gnome, KDE or other desktop users out there, I am probably using the crappiest desktop environment ever made (in fact, it’s just a window manager, not a desktop environment), and I’ve been told that. But let me say this as loud as I can: I just love it: It’s lightweight (lightweight enough to be compared to E16… good old times), clean, simple, etc. It just doesn’t get in the middle and does pretty well the work it has to do.

du_grep_fluxbox_th

I have not used Gnome for a long time, but with my Fluxbox install I don’t have to deal with menus, panels, and Gnome things that breaks leaving your session useless. The only things I need to survive are a few shortcuts to switch desktops, stick/unstick windows, launch terminals/shells and I’m done. I’m one of this “give me a terminal and I’ll rule the world” kind of guys! You think that’s crazy? Well, I have a friend that is even more crazy then… he even removed window’s borders and title bars. He handled windows completely with shortcuts! His motives where simple: “If I remove borders and title bars I’ll gain 10 to 15 extra pixels of screen”, he told me once.

So, trying to get back to the regexps topic: I don’t have a file manager in Fluxbox, so I don’t have a “right button->properties” to click on directories to check disk usage. Even worst: I don’t have a fancy GUI available to check how much disk space is being used by all the files and subdirectories contained inside a give directory. Though, if I think it better, I believe Gnome and KDE users don’t have one either. Now that I realize, I have something even better: I have the all mighty “du“, combined with the powerful “grep” command, which by the way supports regexps!

For those of you who don’t know about it, “du” (I think the command name comes from “Disk Usage”) is a nice program used to check the amount of disk used by a given directory (or by the current working directory by default). It generates reports like the following one:

248     ./model
140     ./src/view
212     ./src
3260    ./lib
48      ./bin/view
112     ./bin
1024    ./Make
284     ./ldraw/p/48
1480    ./ldraw/p
4640    ./ldraw/parts/s
40848   ./ldraw/parts
42332   ./ldraw
60280   .

The number at the left represents the bytes occupied by the files in the reported “leaf” folder in the given directory tree. The last line summarizes the bytes used by the given directory in the current working directory. It counts the space used by each subdirectory and the space used by all the files in the current directory.

Although the last “du” obtained example results interesting, it has two main problems / disadvantages:

1) I don’t understand numbers bytes, at least not when talking about disk space. I understand numbers in kilo, mega or giga bytes, but when you show me a number in bytes, specially a big one, I just don’t get it: Guilty, may be I can’t divide by 1000 (or by 1024 to be exact) as fast as you can.

2) I don’t care about the disk space used by each leaf. I’m only interested in the first level. I don’t want to know if I have to delete ./ldraw/p/48 directory, I need to know if I have to delete the whole ldraw directory because it’s using too much disk space. And I don’t want to manually add all the entries under the ldraw directory to find out what I need.

So, du has a few options that solves the stated problems, it’s possible to type:

du -h --max-depth=1

To get:

248K    ./model
212K    ./src
3.2M    ./lib
112K    ./bin
1K      ./Make
7.0M    ./.svn
42M     ./ldraw
59M     .

It’s much better now. Disk usage in each entry is reported in kilo, mega or giga bytes and the result only reports the disk usage under the top directories (but adding subdirectory usage), so you know that ldraw is the one here using too much disk space.

But what if you have a directory with a lot of subdirectories arranged in such a way that only a few of them use most of disk space, while the vast majority use just a few kilobytes? You are interested in locating the few entires that uses most of disk space (those that use megabytes or gigabytes) while ignoring those that use just a few kilobytes. There’s where grep [bold] enters the scene:

du -h –max-depth=1 | grep M

To get:

3.2M    ./lib
1K      ./Make
7.0M    ./.svn
42M     ./ldraw
59M     .

This way you will only get megabyte entries, but… but, the result it is not accurate enough, and that is exactly what I’ve been doing for a long time… until yesterday, when I chose to use regular expressions to get a better result.

The problem with the previous grep is that it will keep only the lines with an M, any M, anywhere. For example, the Make folder gets included in the result, despite the fact that it only uses 1K of disk space. So, in some cases we’ll get entries using kilobytes being reported because the directory name has a capital M somewhere.

How is this problem solved? With a little bit of regular expressions. First we realize that the result matches a pattern (aren’t regular expressions used to match patterns after all?): One or more numbers, then a dot (possibly), then one or more numbers, either a K, M or G letter (which are the ones we are interested in) and finally, whatever it comes: the spaces before the directory name, the directory name itself, etc.

That pattern can be this way “encoded” in a regular expression:

"([[:digit:]]|\.)+(K)(.)+"

So, grep [bold] can be used like in the previous example to check those lines using “megabytes” or using “gigabytes”:

du --max-depth=1 -h | grep -E "([[:digit:]]|\.)+(M)(.)+"

Or…

du --max-depth=1 -h | grep -E "([[:digit:]]|\.)+(G)(.)+"

The “-E” switch is used to enable extended regular expressions. By default, grep uses some kind of basic and very limited regular expressions (I don’t remember right now what are the limitations, so read the manual XD) and without the “-E” some features of the previous regular expression will just not work.

Also, if you want to show only entries using megabytes OR gigabytes, you can change the regular expression to include either Ms and Gs:

du --max-depth=1 -h | grep -E "([[:digit:]]|\.)+(G|M)(.)+"

If you’ve read this far, then you should have the feeling that regular expressions are something you should really know about. And like I said at the beginning, I believe it’s the kind of thing that you should learn little by little as you use it. Learn the basic concepts, the way they are used and what they are used for, and the next time you face the kind of problem where it would be useful to match / replace some string, then you can try to use a regular expressions approach.

At the beginning, there will be situations where it’ll take longer to solve a problem using regexps, than to do it using any other traditional way. But, you’ll sure be amazed with the elegance of the solutions. With enough time, you’ll learn to write better expressions and to do it very fast, so you will add a very powerful tool to your toolbox.

For example, since I started writing this article (which was not long ago), I’ve used regular expressions in at least two completely unrelated situations. The first one to remove leading blank spaces in a wrong indented PHP code using vi:

:%s/^    //g

notice four white-spaces between the ^ and the second /

and the other to do redirects and URL rewriting using mod_rewrite in a “.htaccess” file. Useful, isn’t it?