This week I learned about state machines in AWK.

The AWK book gives an example of parsing multiline records with a simple state machine.

Consider a simple example, again an address list, but this time each record begins with a header that indicates some characteristic, such as occupation, of the person whose name follows, and each record (except possi- bly the last) is terminated by a trailer consisting of a blank line:

accountant
Adam Smith
1234 Wall St., Apt. SC
New York, NY 10021

doctor - ophthalmologist
Dr. Will Seymour
798 Maple Blvd.
Berkeley Heights, NJ 07922

lawyer
David w. Copperfield
221 Dickens Lane
Monterey, CA 93940

doctor - pediatrician
Dr. Susan Mark
600 Mountain Avenue
Murray Hill, NJ 07974

To print the doctor records without headers, we can use

/^doctor/ { p = 1; next }
p == 1
/^$/ { p = 0; next }

Pretty neat.

I wrote a small program to convert an html file into plain text in just a few lines of AWK using this technique.