ISO 17100:2015, ISO 27001:2017 and ISO 18587 certified translation and localisation services
Menu
Our translatorLoginQuick quote
Menu

NLP in Linux

September 19, 2021

The easiest way to process texts

It is commonly believed that the processing of text documents requires knowledge of a high-level programming language. A decade or so ago it was considered proper to have good knowledge of the Perl language, while today a specialist in the field “absolutely must” have mastery of Python. But is this knowledge really indispensable?

In this post I will show that even sophisticated tasks in text processing can be completed quickly and easily without knowledge of a single command in any programming language.

How many words are there in my text?

Let’s begin with the simplest task: checking how many words are contained in a given text document. This can be done using the following sequence of instructions:

  • read data from the document (text file);
  • count the number of words in the retrieved data.

If the result is to be stored (saved in some file), the sequence of instructions may take the form:

  • read data from the text file;
  • count the number of words in the retrieved data;
  • save the result to a file.

The Bash Box platform

There are instructions like this available in the Linux operating system. To obtain the desired result, you need to enter the appropriate command names, which have to be learnt. However, I can recommend something much more attractive: a graphical program in which, instead of entering hard-to-memorise commands, you can simply move pieces of a jigsaw.

Go to the page https://s416072.students.wmi.amu.edu.pl/, and you will see a screen like this:

Among the more than 20 jigsaw pieces on display, drag those that you want to use onto the bar at the bottom. For example, to count the words in a document:

Note that the first piece (Read File) has a protrusion on the right-hand side only, which means that another element is expected only on that side. The shape of the second piece suggests that other pieces should appear on both sides, whereas the third (Save To File) can appear only at the end of the jigsaw.

In what document do we want to do a word count? Bash Box provides four options:

If you select the file containing information on young offenders, the content of that document will be displayed on the screen:

The first row contains the information that in the year 2010 there were 741 female young offenders aged 17. Subsequent rows contain similar information for different ages and sexes.

To count the number of words in this document, we “translate” the sequence of instructions from the jigsaw into a corresponding Linux command. This is done using the button highlighted below:

A special environment, popularly called the “black window”, is available in the Linux system for issuing commands. This is the Bash shell. On translation, our sequence of instructions takes the following form in the Bash shell:

This code looks somewhat complicated, but let’s try to analyse it.

  • :~$ is the ready symbol that appears automatically in the Bash shell to indicate readiness to accept a command from the user;
  • cat filename.txt is a command to read data for further processing from a file called filename.txt;
  • | (the pipe symbol) is a command separator, and means that the output of the command on the left is to be processed by the command on the right;
  • wc -w is a command to count the number of words;
  • > finalfile.txt is a command to save the data in a file called finalfile.txt.

In the window at the side you can see the result of the whole pipeline or sequence of commands:

This tells us that the document on young offenders contains a total of 819 words.

Puzzles

Bash Box also lets you play at text processing by solving puzzles in the style of stories from westerns. To do this, use the following button at the top left of the screen:

Let’s solve the first puzzle:

The format of the document from the database mentioned in the story is displayed alongside the puzzle:

To obtain the required answer, we have to perform the following steps:

  • read data from the file;
  • order the data by increasing value of the first column;
  • display the twenty initial data points (the first twenty database rows);
  • save the result to a file.

We drag suitable jigsaw pieces to the bar on the screen:

Now we check whether the answer obtained is correct:

The Check button turns orange, which means that the answer given was wrong! We therefore drag the third piece back to its original place, replacing it with another piece:

Now the Check button turns green, which means we can happily move on to the next of the eight puzzles.

An example use of Bash shell commands

If you were to ask a computer science student how to obtain a frequency list of all words appearing in the complete works of Shakespeare, for example, they would probably answer that it would be simplest to write suitable code in the Python programming language. There is a much easier way of doing it, however – by means of a single Bash shell command:

cat sh.txt | tr -sc ’A-Za-z’ ’\n’ | sort | uniq -c > sh_frequency_list

Assume that the text file sh.txt contains Shakespeare’s complete works. Then:

  • cat sh.txt reads the data from that file for further processing;
  • tr -sc ‘A-Za-z’ ‘\n’ replaces all non-letter characters (such as spaces) with a newline character (so that each word appears on a separate line);
  • sort arranges all the lines (single words in this case) in alphabetical order (so that when the same words appear they will be placed together on neighbouring lines);
  • uniq -c eliminates duplicate lines (those containing the same word), at the same time recording the number of repetitions that were encountered;
  • the resulting frequency list is written to the file sh_frequency_list.

Summary

Processing text documents in Linux is extremely effective. With a single command you can achieve the same results as you could with complicated programs. The only difficulty to overcome is learning the available commands and understanding how to arrange them in a pipeline.

This is certainly worth doing, however – and here the Bash Box platform can be of enormous help!

crossmenuchevron-down linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram