Thursday, November 17, 2016

StarDict as an open-source toolchain for working with dictionaries

http://filosofie.unibuc.ro/~solcan/wt/gnu/s/stardict.html

The first snippet after a “StarDict” search on Google sounds like a commercial message: “The best dictionary program in Linux and Windows”. Under it, with fine print there is a more qualified message: “The best free dictionary in Linux and Windows”. The article about StarDict in Wikipedia is more sober and informative, but not very comprehensive. 
In fact, StarDict is a toolchain for working with dictionaries: you can compile and decompile dictionaries, use them with a GUI shell for searches and much more and there is even a commandline version of the program. I find the open source toolchain far more important than the “power” of the dictionary shell alone. 
I will try in this note to share my experience with StarDict. This experience - in the first half of January 2010 - is limited to the Fedora distribution of Linux and the 2.4.5 and 3.0.1 versions of the StarDict program. “Share” does not mean that I wrote a friendly tutorial. I just described the experience with StarDict. 
Please note two things: these are working notes (no claim of clarity!); you must be familiar with GNU/Linux, regular expressions, the Vim programming editor and other things connected with programming. Reading a text like this is not enough. You must experiment with the programs on a real computer. If you are just a computer user, these lines will be of little help and you might just end up in total confusion. 
[18 February 2010: for further observations concerning StarDict, see also the notes on the use of DEX online with StarDict under GNU/Linux and Windows.]

Installation

I did some years ago execute a “configure, make and install” procedure for StarDict, but since Fedora 6 I have just used the RPMs provided by the distribution. Thus I have nothing special to note about the initial part of the installation process. 
Can you start the program automatically when you log in? Yes, you can. Under Gnome I have added a file called StarDict.desktop to the directory
~/.config/autostart
This is an usual desktop file, i.e. a text file that you can create with an editor like Vim. You can see in the screenshot its content, in my case. 
stardict-desktop
The content of the desktop file
If you do not want to use the more intricate machinery of the rpm command, in order to know where the files installed by StarDict are placed you can just use File Roller (or some other archive manager which is capable to show these files). 
With root privileges (usually when you install form RPMs) the dictionaries go to the folder:
/usr/share/stardict/dic
You may also use/create in your home a folder
~/.stardict/dic
I put, for exemple, in this folder the dictionaries that I create myself. I find this location more convenient on the computer that is just for my personal use. 

Using StarDict

For the moment, I will just suppose that you have installed at least one dictionary. For example, under, Fedora 10 one can install
stardict-dic-en-2.4.2-5.fc10.noarch.rpm

The Preferences

You can start the program (for example, clicking the icon in Accessories) and you will see the main window of StarDict. On the top-right corner you will see the familiar “home” icon: a little house. There you can access the Main menu. Then you can read the StarDict Manual. In my 3.0.1 version of the program, the manual is still for version 2.4.2, but is quite useful for a start. I will not repeat what is written there. I will insist on preferences and the management of dictionaries and plugins, as they are available in the 3.0.1 version. 
The button for the preferences dialogue is the last one on the bottom right corner. In the Dictionary menu you find a series of options. 
I left a tick on Dictionary/Cache/Create cache files. The Sort word list by collation function is useful for languages as my native Romanian. Without it the program does not know how to order the words according to the Romanian alphabet. It does not make sense however to activate the collation function for Romanian when you use other dictionaries. 
Dictionary/Export means that you can copy the content of the dictionary article to a (text) file. Check the location of the text file! 
Dictionary/Sound has an option for Use TTS program. This means that I can put there a line like this:
espeak -v ro %s &
This means that the eSpeak speech synthesizer will pronounce the word in Romanian (ro). Check the available voices in a terminal with:
espeak --voices
For current use I disable the Dictionary/Sound option. 
The Network/Net Dic option has been the subject of discussions concerning security. I do not enable network dictionaries. You can read more about the risks on the Internet. First, there is the risk of sending sensitive content from the clipboard on the net. Second, you certainly have to trust the server that you use when you enable this option. 
The Main window/Serch webside has an obvious meaning. For example, I use this option to go to Wikipedia articles. The key is the website search link:
http://ro.wikipedia.org/wiki/%s

The Management of the Plugins

On the top-right corner, in the Main Menu, one can find the Manage Plugins dialogue. You can enable/disable and configure the plugins. In the figure you can see an example for the configuration of the Spell Check plugin. 
stardict-spell
The dialogue for spell check configuration
What is the meaning of that ro for the language? How does StarDict understand it? 
StarDict uses Enchant. You can control the behavior of Enchant with the .enchant configuration file, placed in your home. From the figure, the syntax of this configuration file should be obvious. 
stardict-enchant
The Enchant configuration file
How can I know that the plugins have been loaded without errors? Open StarDict from a terminal and read the messages. Again, the figure should give an idea about what kind of messages you can get. 
stardict-messages
Stardict messages
When you try to use a non-existent language you get an error message. 
Now, when you install StarDict, pay attention to the dependencies. For example, the installer would probably require Enchant, but you may also need the enchant-aspell-1.4.2-4.fc10.i386.rpm or some similar file, because you want to use Aspell. 
A very interesting plugin is the one for WordNet dict rendering. It can be configured in graphic mode. 
stardict-wordnet
StarDict as WordNet browser
Most important are the plugins for the Data Parsing Engine. I left all of them enabled. Of course, you may disable some of them, but you really must know what you are doing. They affect the look of the Definition area

The management of the dictionaries

Where can I get dictionaries? Go to the StarDict page and download dictionaries. 
I find the *.tar.bz2 archives very convenient. As shown in the figure, I use the mc commander for the installation of these dictionaries. 
stardict-mc
Use mc for the installation of dictionaries
On the right panel of the mc the archive is opened. On the left panel an appropriate folder has been created. Then use mc for copying from the right to the left panel. Pay attention to the attributes of the files! 
The new dictionary appears in the list of the dictionaries when you restart StarDict. 
What happens if I want to use two versions of the WordNet dictionary, for example? Hack the .ifo file! Change the name of the dictionary. For example, put:
bookname=WordNet2
Is this important? Not that much, but it helps you when you manage the dictionaries. 
How I manage the dictionaries? First you must look for the Manage Dictionaries button on the bottom-right corner of the main window. Then, of course, you have to open the dialogue box. The essential panel is Manage Dict
You can group the dictionaries. Click (in order to select it) the line on which it is written Default Group. Press the button with + on it (the add button). In the dialogue box which will show up write the name of the new group. Press OK. Now, select the line with Query Dict on it. Press the add button. Select a dictionary from the list. Then repeat the operation on the Scan Dict line. 
What is the difference between real dictionaries and virtual dictionaries? Virtual dictionaries are created by plugins using commands from the GNU/Linux system. 
In the figure, you can see the result of the above operations for a group called Romanian
stardict-manage-dict
The Romanian group of dictionaries
Of course, in order to get spelling suggestions for Romanian you have to configure the spell check plugin as shown above. 
Is pressing the Delete button a tragic event? Not really. You do not erase the dictionary from the list. You just erase it from the group. You can put it back. 
In the main window, on the left side, under the icon with a broom you find four buttons (five, if you have installed a tree dictionary). You can use the last button to choose a dictionary group. [Pay attention to security problems if you press Enable Net Dic; especially, do not log to an untrusted site.]

The format of the files

Now, let's move a bit towards the workshop where you can forge StarDict dictionaries. For this one must understand a bit the format of the files used by StarDict. 
The format of the files is described in the documentation available on the site of the StarDict project. There are three essential files for the dictionaries. The ifo file contains information about the dictionary, such as the number of words (of articles) or the name of the dictionary. The dict.dz file contains the articles of the dictionary. The idx file contains a sorted list of entries. From version 2.4.8 on, there might be several entries containing the same word(s), but corresponding to different definitions. This is useful for dictionaries with several definitions for the same word (like DEX online, the online dictionary of the Romanian language, for example; see also the notes on the use of DEX online with StarDict under GNU/Linux and Windows). 

The dict files

In fact, the dict.dz file is a compressed dict file. 
The dict files are described on the site of the DICT project. 
For the work on a StarDict dictionary you are going to need a tool from the DICT project: dictzip. This is a program for compressing and uncompressing dictionaries. The sources have been created by the DICT group, but they are compiled in various GNU/Linux distributions. 
Under Fedora, I have used an RPM for the installation of dictzip. When the tool is called dictzip you need the -d option on the commandline for decompression. 

The dictionaries and the StarDict editor

On the Internet, when it comes to software like StarDict, people seem to look for programs with a lot of dictionaries. This is the case of StarDict, but dictionaries are not like manna. They do not fall from heaven. You need tools to build them. 
I think that the most precious thing is to have an open way of building the dictionaries. StarDict has a set of stardict-tools. I have examined the 2.4.8 and the 3.0.1 versions of the tools. I will refer mainly to the later version. 
A set of 35 tools might look very frightening. Many might think that this is not for them to try. In fact, the stardict-editor, which comes with the whole set is very friendly. It has a graphical interface, it is easy to use and fast. 
Under Fedora 10, I had a problem with the compilation. There is no official rpm of the stardict-tools and one must compile from the sources. However, one has to patch the sources, because the gcc compiler does not accept them. 
Some Linux distributions do include the stardict-tools and they have patches for the sources. For example, the Arch Linux repository has a patch for stardict-tools. 
For Windows there are binaries of the stardict-editor. I have not tested them however. 
The stardict-editor includes “a simple UTF-8 text file editor”. In fact, I did not use the edit function of the tool. Instead, I have used Vim. You can use of course any text editor. Avoid however WYSIWYG monsters! 
Now, what you really need to use is the compile function from the stardict-editor. I will show in a simple example how easy it is to use the compiler for StarDict dictionaries. 

A very simple example

HanDeDict is a Chinese-German Dictionary. Its license is a German version of Creative Commons. You can download HanDeDict in EDICT format
Now, I will describe my recipe. It is not difficult to adapt it. The “cooking-time” is very short. 
First, one has to identify - in the downloaded archive - the file which is encoded in UTF-8 (because this is the encoding used by StarDict). Then you have to open it in an editor (I use Vim) and study it a bit. 
stardict-dedict-src
HanDeDict opened in Vim
The structure of each line is the following: (1) first is the entry in traditional Chinese script; (2) then the entry in simplified Chinese; (3) the pronunciation (using tone numbers) - enclosed in square brackets; (4) the meanings explained in German - enclosed in slashes. Spaces separate each structural element from the next one. 
The stardict-editor can use as a source a “Tab file”. This is a text file in which all the dictionary articles are written on one line and the lexicon entry is separated by a tab from the rest. There are no empty lines, of course. 
Now, the structure of the HanDeDict file is very convenient. One has to replace the first space on each line with a tab. In the figure, one can see how this is done for the line with “Apollon”. The regular expression used for the substitution is at the bottom. The tab is clearly indicated by Vim. 
After a successful test, one can put a percent in front of the s, on the bottom line, and make the substitution in the whole file. This is the source that one needs for the stardict-editor. 
I have also prepared a source with the lexicon entry in the simplified Chinese script. For this one has also to invert the structural elements (1) and (2). This is very easy (using Vim!). 
stardict-dedict-simple
HanDeDict source for the stardict-editor with simplified Chinese first
In the next step one calls the stardict-editor and compiles the sources. This is a very simple operation, as shown in the figure. 
stardict-editor
Compilation with the stardict-editor
The warning that we see in the figure is caused by the copyright notice on the first line of the HanDeDict file. All the other lines are OK and one can install the dictionary. 
I prefer an installation in the home (in the .stardict/dic folder). It is easy to create there a folder and put the dict.dz, idx and ifo files of the dictionary in it. Then one has to open StarDict and use the dictionary. 
stardict-dedict
The HanDeDict dictionary in StarDict
In the figure one can see how, using the scan facility of StarDict, one can easily find the meaning of the Chinese words. In the figure we use as an example the Wikipedia article about StarDict. 
Adding Gucharmap as a virtual dictionary is very useful. One can get the codes for the Chinese characters. I also enjoy the better visibility of the characters with Gucharmap. With bitmap fonts it's better than the magnifier. 
You can also add sound to the dictionary. I will only sketch a solution. 
In the Preferences dialogue enable Use TTS program, as shown in the figure. 
stardict-sound
Enable sound
The external program cnplay is a small piece of software written accoding to the following scheme: it takes as input a string (in the case of StarDict the selected text); the string is cleaned and then splited into a list (of syllables written in pinyin with tone numbers); finally, another external program is used for playing audio files (each containing the pronunciation of a syllable). 
stardict-audio
Play a string of Chinese syllables
I will add also a few remarks about the searches. 
One can search with StarDict in the whole HanDeDict. For example, let's say that I have heard the Chinese saying xie4 xie5. It is possible to look in the whole HanDeDict for this sequence in pinyin (and find that it means thank you). 
Now that I know the character for thanks, I can use a regular expression and find out all the entries which begin with this character. Even more interesting, I can find all the entries which contain in some position this character. 
Summing up, with a few simple procedures I can satisfy my curiosity and find out the meaning of some of the mysterious (for me, of course!) Chinese characters in a Wikipedia article or elsewhere. 

The commandline tool

There is a commandline version of StarDict. The name of this tool is “sdcv”. It has a home page
The version that I have examined is 0.4.2. As in the case of the stardict-tools there problems with the compilation with newer versions of gcc. One can read a discussion on AUR about patching the sdcv source. 
The sdcv tool works with dictionaries of the older 2.4.2 type. 
The tool is very useful when you want to test the compiled dictionaries without installing them. It has an option for the path of the database. 
stardict-sdcv
A search on the commandline

The versions of StarDict and their authors

StarDict has a rather long story. 
At this time, the project leader is Hu Zheng. Evgeniy A. Dushistov and HuZheng developed sdcv. Alex Murygin has contributed to the project. There is also a long list of people who have translated the menus of StarDict in various languages. 
Acording to Wikipedia, StarDict evolved from the program StarDic by Ma Su-an. 
Version 2.4.2 of StarDict was a milestone. It was released in 2003. It had a new dictionary format. The version 2.4.5 (included in Fedora 6) was released in 2005. 
The plugin system was introduced in StarDict-3.0.0 (RedHat). 
Version 3.0.1 (included in Fedora 10) was released in 2007. See the ChangeLog file in 
/usr/share/doc/stardict-3.0.1/
for a history of the changes of the StarDict program. 
In the same folder you can find the COPYING file (the GPL license). 
While 2.4.5 was a shell for dictionaries in text-mode, 3.0.1 has the capability to use mark-up languages. 

A collection of all the files of the program (including older versions) can be found on sourceforge.net