awk, Linux utilities

You must known the Linux utilities out there, they are really valuable !

I already know awk, but never actually use it until recently. I used sed, but not that much.

I didn’t see the real value of awk until I came to the following use case : a Word document to transform into a csv format ! I said that it was impossible : you have to to it by hand because it is unstructured !

Use the awk, Luke !

2 steps : save as plain text (note that it should be possible to do things with rtf too) Then write a simple awk script, like this one :


#!/usr/bin/awk -f
BEGIN {
    cat="categorie"
    FS="\\n"
    RS="Definition"
    ORS=""
}
{
    print  cat, ",", $6, $7, ","      # categorie, 6th and 7th lines concatenated, 

    # Skip lines to the real content
    x=9
    while ( $x != "Confirmation" ) {
        x++
    }
    x++

    # Print the content between Confirmation and Observations 
    # if it doesn't begin with a number
    while ( $x != "Observations:" ) {
        if ( $x !~ /[0-9]+./ )  {
            print $x, " "
        }
        x++
    }
    print "\\n"
}

Simple ? Yes. The important part is in the BEGIN block : awk cuts things in rows and columns. By default, it uses new lines to separate records and “,” to separate fields. BUT, you can change this : I tell him that the Field Separator is the new line “\n” ; the Record Separator is the word “Definition” and that the Output Record Separator is just a space. It is why I have to put a new line at the end.

My example may not be very explicit, but imagine that you can extract all the paragraph after the title “Introduction” or specifics parts of a document. It works great if your Word doc contains tables.

Here is a good references : IBM Common threads: Awk by example by Daniel Robbins. Check the part2, it is really usefull.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: