It is commonly believed that the processing of text documents requires knowledge of a high-level programming language. A decade or so ago it was considered proper to have good knowledge of the Perl language, while today a specialist in the field “absolutely must” have mastery of Python. But is this knowledge really indispensable?
In this post I will show that even sophisticated tasks in text processing can be completed quickly and easily without knowledge of a single command in any programming language.
Let’s begin with the simplest task: checking how many words are contained in a given text document. This can be done using the following sequence of instructions:
If the result is to be stored (saved in some file), the sequence of instructions may take the form:
There are instructions like this available in the Linux operating system. To obtain the desired result, you need to enter the appropriate command names, which have to be learnt. However, I can recommend something much more attractive: a graphical program in which, instead of entering hard-to-memorise commands, you can simply move pieces of a jigsaw.
Go to the page https://s416072.students.wmi.amu.edu.pl/, and you will see a screen like this:
Among the more than 20 jigsaw pieces on display, drag those that you want to use onto the bar at the bottom. For example, to count the words in a document:
Note that the first piece (Read File) has a protrusion on the right-hand side only, which means that another element is expected only on that side. The shape of the second piece suggests that other pieces should appear on both sides, whereas the third (Save To File) can appear only at the end of the jigsaw.
In what document do we want to do a word count? Bash Box provides four options:
If you select the file containing information on young offenders, the content of that document will be displayed on the screen:
The first row contains the information that in the year 2010 there were 741 female young offenders aged 17. Subsequent rows contain similar information for different ages and sexes.
To count the number of words in this document, we “translate” the sequence of instructions from the jigsaw into a corresponding Linux command. This is done using the button highlighted below:
A special environment, popularly called the “black window”, is available in the Linux system for issuing commands. This is the Bash shell. On translation, our sequence of instructions takes the following form in the Bash shell:
This code looks somewhat complicated, but let’s try to analyse it.
In the window at the side you can see the result of the whole pipeline or sequence of commands:
This tells us that the document on young offenders contains a total of 819 words.
Bash Box also lets you play at text processing by solving puzzles in the style of stories from westerns. To do this, use the following button at the top left of the screen:
Let’s solve the first puzzle:
The format of the document from the database mentioned in the story is displayed alongside the puzzle:
To obtain the required answer, we have to perform the following steps:
We drag suitable jigsaw pieces to the bar on the screen:
Now we check whether the answer obtained is correct:
The Check button turns orange, which means that the answer given was wrong! We therefore drag the third piece back to its original place, replacing it with another piece:
Now the Check button turns green, which means we can happily move on to the next of the eight puzzles.
If you were to ask a computer science student how to obtain a frequency list of all words appearing in the complete works of Shakespeare, for example, they would probably answer that it would be simplest to write suitable code in the Python programming language. There is a much easier way of doing it, however – by means of a single Bash shell command:
cat sh.txt | tr -sc ’A-Za-z’ ’\n’ | sort | uniq -c > sh_frequency_list
Assume that the text file sh.txt contains Shakespeare’s complete works. Then:
Processing text documents in Linux is extremely effective. With a single command you can achieve the same results as you could with complicated programs. The only difficulty to overcome is learning the available commands and understanding how to arrange them in a pipeline.
This is certainly worth doing, however – and here the Bash Box platform can be of enormous help!