Pages

Kamis, 17 Mei 2012

Automatically analyze text with these simple, command line tools


Takeaway: Marco Fioretti suggests some free tools you can use to check and improve your own writing style as well as analyze other types of text. Here are some tips on how to use them.
The world of Free Software is full of little-known programs that are completely irrelevant for many users, but lifesavers for others. If only those other users knew that such programs exist! This thought is what led me to the topic of this post.
Computers surely aren’t better than humans at recognizing which texts are well written and which ones are not. Text analysis software, however, can detect many trivial, but very common errors and bad practices automatically.

Enter diction and style

Automatic style checking using only Free Software is possible with two almost unknown GNU tools, old but still available as binary packages for the most popular distributions: diction and style. The first program detects verbose and commonly misused phrases; the second tries to evaluate readability. At the moment, they officially support only English, German, and Dutch. However, several features will work, at least approximately, in any language.
What you can do with GNU style and diction? Well, you can:
  • check for wordy, trite, clichéd, or misused phrases in a text
  • check for double words
  • measure the readability of a text, according to several formulas
  • calculate other style-related statistics, like the number of questions and imperatives in a text
The man pages of diction and style explain their options well, and several recipes in the Linux Cookbook already show with basic examples how to use these programs at the prompt.

The benefits of “large scale” text analysis

If good introductions to those programs already exist, what is the point of this post?
Easy: it is to highlight what is, in my opinion, the most important and lesser known point about the GNU style and diction utilities. I refer to how much and why these programs can be powerful and save your time, when you connect them with other software.
The output of GNU style and diction can (at least) provide some helpful instruction for those trying to improve their writing style.
Within the limits I already mentioned, teachers (and head-hunters…) may save time with style-checking software in another, even more important way: first-level evaluation. If a computer finds that your students repeatedly make the same grammar errors, you will realize, without effort, what you should explain again!
In the same way, finding which applicants for a PR job can’t write sentences with less than 70 words will tell you who you should interview last. Even political activists may use text analysis software, to find which candidates speak to their voters more clearly. The possibilities are endless.

How to analyze many documents automatically

First of all, being traditional Unix programs, style and diction can work in pipe with other programs. This makes it is very easy to write scripts that find all the text files in a directory, calculate the readability index of each file, and then find its average value for each topic or user.
In the second place, while diction and style only work on plain text, almost every text-baseddocument can be automatically converted to such a format on Linux. You have to evaluate a bunch of papers in MS Office or OpenDocument formats? No problem! Unoconv can feed their plain text version to style or diction. This is how you would get the Fog Index numeric value of all the .odt files in the current directory:
  for FILE in `find . -type f -name "*.odt"`
  do
    echo -n "$FILE:   "
    unoconv --stdout $FILE | style | grep Fog | cut -d: -f2
  done
Doing the same thing with PDF files is equally easy (unless they have multi-column or irregular layouts) with converters like pdftotext. You may also analyze automatically single Web pages or entire websites in this way! In the first case, you should pass those pages to style or diction with a command line browser like w3m. Want some fun? Type this at a command prompt:
w3m -dump http://www.gnu.org/gnu/manifesto.html | diction -s
And watch a GNU program like diction suggest (-s) better wording for… the GNU Manifesto.
Automatic analysis of whole websites could be performed in two steps: first, mirror them on your hard drive with wget, then run a loop that uses w3m and style or diction as needed.
The third and final reason why at least one of these two programs is much more powerful than you may suspect is how easy it is to extend it. Look at the first lines of the diction database:
  # cat /usr/share/diction/en | more
   a considerable amount of   much
   a large number of  many
   a lot of   Often obsolete, should sometimes be replaced by "many"
   a majority of  most
   a man who
   a matter of concern    (cliche, avoid)
See what I mean? Although undocumented, the format is simple enough that it’s not a problem to create your own database and load it instead of the standard one: creating one per line, write words or phrases that you want to avoid, followed by one tab character, followed by a suggestion. You can make your own style guide.