星际译王字典文件格式

buliedian

浏览: 1193454 次
性别:
来自: 北京

最近访客更多访客>>

u012363178

liyanhui111

leimingchao

gstszwx11

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (1445)

社区版块

存档分类

Cache UP .net Linux MySQL

星际译王字典文件格式
Format for StarDict dictionary files
原文：http://stardict.sourceforge.net/StarDictFileFormat
星际译王主页：http://stardict.sourceforge.net/
StarDict on-line dictionary: http://www.stardict.org
刘建文略译并整理（http://blog.csdn.net/keminlau）

kemin:深入底层，文件格式这一关必过。网上简单搜索了一下，有人零星的简单的翻译过，但是这是个参考文档非教程，也就是信息足但系统性不高，对掌握文件格式及其设计机理没过多的信息，遂按自己的理解简略翻译一遍并整理成目录，抛砖引玉。

{0}. 数值长度与字节顺序约定
{1}. 字典文件
{2}. 字典元信息文件的格式（扩展名为".ifo"）
{3}. 索引文件的格式（扩展名为".idx"）
{4}. 同义词典文件的格式（扩展名为".syn"）
{5}. 偏移地址的缓存文件的格式
{6}. 校对文件的格式
{7}. 字典数据文件的格式（扩展名为".dict"）
{8}. 资源存储（Resource Storage）
{9}. 树形字典（Tree Dictionary）
{10}. 更多信息（More information）

{0}. Number and Byte-order Conventions数值长度与字节顺序约定

When you record the numbers that identify sizes, offsets, etc., you should use 32-bits numbers, such as you might represent with a glong.
In order to make StarDict work on different platforms, these numbers must be in network byte order. You can ensure the correct byte order by using the g_htonl() function when creating dictionary files. Conversely, you should use g_ntohl() when reading dictionary files.
Strings should be encoded in UTF-8.

当你使用数值标识大小、偏移等量时，必须使用32位数值，使用宏glong可达到这一点；
为了让StarDict 能跨平台使用，数字的字节顺序必须使用网络字节顺序。你可以在创建新字典时通过使用g_htonl()函数来确保正确的字节顺序。读出字典数据时使用g_ntohl()；
字符串必须使用UTF-8编码。

{1}. Files字典文件

Every dictionary consists of these files:
(1). somedict.ifo
(2). somedict.idx or somedict.idx.gz
(3). somedict.dict or somedict.dict.dz
(4). somedict.syn (optional)

每本字典都由以下文件组成：
(1). somedict.ifo 字典元信息
(2). somedict.idx 或 somedict.idx.gz 索引文件
(3). somedict.dict 或 somedict.dict.dz 字典数据文件
(4). somedict.syn (可选) 同义词典文件

You can use gzip -9 to compress the .idx file.
If the .idx file are not compressed, the loading can be fast and save memory when using, compress it will make the .idx file load into memory and make the quering become faster when using.

你可使用gzip -9 压缩索引文件
如果索引文件没有被压缩，这样读取的时候快而且节省内存，压缩索引文件，会将文件装载进内存，查询的时间会更快。（kemin:不压缩说快，压缩说更快，具体如何？）

You can use dictzip to compress the .dict file.
"dictzip" uses the same compression algorithm and file format as does gzip, but provides a table that can be used to randomly access compressed blocks in the file. The use of 50-64kB blocks for compression typically degrades compression by less than 10%, while maintaining acceptable random access capabilities for all data in the file. As an added benefit, files compressed with dictzip can be decompressed with gunzip.
For more information about dictzip, refer to DICT project, please see:
http://www.dict.org
When you create a dictionary, you should use .idx and .dict.dz in normal case.

你可以使用dictzip 压缩字典数据文件
"dictzip"与gzip使用相同的压缩算法和文件格式，不过"dictzip"多用了一张表提供了随机访问文件内部压缩数据。50-64kB大小的块明显的提高压缩比10%以上，同时用于维持随机访问文件中所有数据的能力。

Stardict will search for the .ifo file, then open the .idx or .idx.gz file and the .dict.dz or .dict file which is in the same directory and has the same base name. ^

{2}. The ".ifo" file's format.字典元信息文件格式

The .ifo file has the following format:

StarDict's dict ifo file
version=2.4.2
[options]

Note that the current "version" string must be "2.4.2" or "3.0.0". If it's not, then StarDict will refuse to read the file. If version is "3.0.0", StarDict will parse the "idxoffsetbits" option.

.ifo文件格式如下：
StarDict's dict ifo file
version=2.4.2
[options]

[options]
---------
In the example above, [options] expands to any of the following lines specifying information about the dictionary. Each option is a keyword
followed by an equal sign, then the value of that option, then a newline. The options may be appear in any order.

Note that the dictionary must have at least a bookname, a wordcount and a idxfilesize, or the load will fail. All other information is optional. All
strings should be encoded in UTF-8.
Available options:
bookname= // required
wordcount= // required
synwordcount= // required if ".syn" file exists.
idxfilesize= // required
idxoffsetbits= // New in 3.0.0
author=
email=
website=
description= // You can use <br> for new line.
date=
sametypesequence= // very important.

[options]可以是任意行以下的有关字典的信息选项（当然有些是必须的），每行一个键值对，然后是新行，这些选项是顺序无关的。
可用选项：
bookname= // 必须的
wordcount= // 必须的
synwordcount= // 如果有用“.syn”文件时是必须的
idxfilesize= // 必须的
idxoffsetbits= // New in 3.0.0
author=
email=
website=
description= // You can use <br> for new line.
date=
sametypesequence= // 不是必须的但非常重要

wordcount is the count of word entries in .idx file, it must be right.
idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed to a .idx.gz file, this entry must record the original .idx file's size, and it
must be right too. The .gz file don't contain its original size information, but knowing the original size can speed up the extraction to memory, as you don't need to call realloc() for many times.
idxoffsetbits can be 64 or 32. If "idxoffsetbits=64", the offset field of the .idx file will be 64 bits.
The "sametypesequence" option is described in further detail below.

wordcount是索引文件包括单词条目的总数，这个值必须正确；

idxfilesize 是索引文件的大小（单位是字节）。这个大小是原始文件的大小，不能是压缩后的大小，这个值也必须正确。（gz不包含原始文件的大小，但知道这个大小可以加快解压的速度，因为你不必多次调用realloc()函数。）

idxoffsetbits 偏移地址长度，也就是决定索引文件索引项的偏移地址字段的长度，可以是64或32位。

***
sametypesequence
You should first familiarize yourself with the .dict file format described in the next section so that you can understand what effect this option has on the .dict file.
要理解这个选项对.dict文件的作用前，你必须先熟悉字典数据文件.dict的文件格式，这部分内容在第{7}节有详细讲述。
If the sametypesequence option is set, it tells StarDict that each word's data in the .dict file will have the same sequence of datatypes.
In this case, we expect a .dict file that's been optimized in two ways: the type identifiers should be omitted, and the size marker for the last data entry of each word should be omitted.

Let's consider some concrete examples of the sametypesequence option.
Suppose that a dictionary records many .wav files, and so sets:
sametypesequence=W
In this case, each word's entry in the .dict file consists solely of a wav file. In the .dict file, you would leave out the 'W' character before each entry, and you would also omit the 32-bits integer at the front of each .wav entry that would normally give the entry's length.
You can do this since the length is known from the information in the idx file.

sametypesequence选项（给StarDict ）指定字典数据文件中词条的数据类型（或数据格式）。sametypesequence让我们通过两种方式来优化字典数据文件：忽略类型标识或末数据项的长度记号。
这里举两个具体的例子：
假如你的字典数据文件记录全是
你可以在dict中放入很多笔wav档案记录，
然后在sametypesequence=W接着，
dict file在处理时就会忽略'W'这个字元。

再另外一个常见的情况做为例子：
你可以设做：
sametypesequence=tm
这样每个字在dict档里的处理就会忽略掉't'和'm'字元，
此外，以'\0'和'm'结尾的输入也会被当作每个记录的结尾。

As another example, suppose a dictionary contains phonetic information and a meaning for each word. The sametypesequence option for this
dictionary would be:
sametypesequence=tm
Once again, you can omit the 't' and 'm' characters before each data entry in the .dict file. In addition, you should omit the terminating
'\0' for the 'm' entry for each word in the .dict file, as the length of the meaning string can be inferred from the length of the phonetic
string (still indicated by a terminating '\0') and the length of the entire word entry (listed in the .idx file).

So for cases where the last data entry for each word normally requires a terminating '\0' character, you should omit this character in the
dict file. And for cases where the last data entry for each word normally requires an initial 32-bits number giving the length of the
field (such as WAV and PNG entries), you must omit this number in the dictionary.

Every dictionary should try to use the sametypesequence feature to save disk space.
***
^

{3}. The ".idx" file's format.索引文件的格式

The .idx file is just a word list. The word list is a sorted list of word entries.
Each entry in the word list contains three fields, one after the other:
word_str; // a utf-8 string terminated by '\0'.
word_data_offset; // word data's offset in .dict file
word_data_size; // word data's total size in .dict file

.idx文件是以单词本身为键值建立的索引文件（即以单词排序的列表），索引项格式如下：
word_str; // 单词：以utf-8 编码的字符串并且以'\0'结尾；
word_data_offset; // 单词在字典文件中的偏移地址；
word_data_size; // 单词在字典文件的数据总大小。

word_str gives the string representing this word. It's the string that is "looked up" by the StarDict.
Two or more entries may have the same "word_str" with different word_data_offset and word_data_size. This may be useful for somedictionaries. But this feature is only well supported by StarDict-2.4.8 and newer.
The length of "word_str" should be less than 256. In other words, (strlen(word) < 256).
If the version is "3.0.0" and "idxoffsetbits=64", word_data_offset will be 64-bits unsigned number in network byte order. Otherwise it will be 32-bits.
It is possible the different word_str have the same word_data_offset and word_data_size, so multiple word index point to the same definition. But this is not recommended, for mutiple words have the same definition, you may create a ".syn" file for them, see section 4 below.
The word list must be sorted by calling stardict_strcmp() on the "word_str" fields. If the word list order is wrong, StarDict will fail to function correctly!
word_data_size should be 32-bits unsigned number in network byte order.

word_str 保存单词的字符数据，StarDict也就是通过它来查找单词的；

同一个"word_str"可能对应该两个或多个的word_data_offset和 word_data_size，这个功能对于一些字典很有用，并且只有StarDict-2.4.8或更新版本支持；

"word_str"的长度必须小于256。

如果version是"3.0.0"并且"idxoffsetbits=64"，word_data_offset将是64位无符号数，使用网络字节顺序。否则word_data_offset都是32位；

word_data_size必须是32位无符号数，使用网络字节顺序；

多个不同的单词指向同一个偏移地址和词条大小是可以的，也就是说，多个单词有相的定义。不过不推荐这样做，推荐创建.syn文件解决多词一义问题，请看下面第4小节；

单词列表必须严格使用stardict_strcmp()函数进行按单词字段word_str排序，不然StarDict 不能正常工作。

============
gint stardict_strcmp(const gchar *s1, const gchar *s2)
{
gint a;
a = g_ascii_strcasecmp(s1, s2);
if (a == 0)
return strcmp(s1, s2);
else
return a;
}
============
g_ascii_strcasecmp() is a glib function:
Unlike the BSD strcasecmp() function, this only recognizes standard ASCII letters and ignores the locale, treating all non-ASCII characters
as if they are not letters.

stardict_strcmp() works fine with English characters, but the other locale characters' sorting is not so good, in this case, you can enable
the collation feature, see section 6.

g_ascii_strcasecmp()是glib的库函数：与BSD strcasecmp() 函数不一样，这个函数只认ASCII 字符而忽略本地化字符，把所的非ASCII 字符看成非字符。
stardict_strcmp()处理英文字符时工作良好，但对处理其它本地化字符就差点儿，你必须使用第6小节介绍的校对功能。

{4}. The ",syn" file's format.同义词典文件的格式

This file is optional, and you should notice tree dictionary needn't this file.
Only StarDict-2.4.8 and newer support this file.

这个文件是可选的，而且树形字典甚至不需要这个文件；
只有StarDict-2.4.8 或更新的版支持；

The .syn file contains information for synonyms, that means, when you input a synonym, StarDict will search another word that related to it.
The format is simple. Each item contain one string and a number.
synonym_word; // a utf-8 string terminated by '\0'.
original_word_index; // original word's index in .idx file.
Then other items without separation.
When you input synonym_word, StarDict will search original_word;

.syn 保存着同义词的信息，就是说，当你查询一个有着同义的词时，StarDict 会在这个文件中搜索相的单词。文件格式如下：
synonym_word; // 同义词，以utf-8 编码的字符串并且以'\0'结尾.
original_word_index; // 在索引中的原词

The length of "synonym_word" should be less than 256. In other words, (strlen(word) < 256).
original_word_index is a 32-bits unsigned number in network byte order.
Two or more items may have the same "synonym_word" with different original_word_index.
The items must be sorted by stardict_strcmp() with synonym_word.

synonym_word的长度必须小于256。

original_word_index必须是32位无符号数，使用网络字节顺序。

不同的原词可能有两个或多个相同的同义词。

文件数据必须是排过序的。

{5}. The offset cache file's format.偏移地址的缓存文件的格式

StarDict-2.4.8 start to support cache files, this feature can speed up loading and save memory as mmap() the cache file. The cache file names
are .idx.oft and .syn.oft, the format is:
First a utf-8 string terminated by '\0', then many 32-bits numbers as the wordoffset index, this index is sparse, and "ENTR_PER_PAGE=32",
they are not stored in network byte order.
The string must begin with:
=====
StarDict's oft file
version=2.4.8
=====
Then a line like this:
url=/usr/share/stardict/dic/stardict-somedict-2.4.2/somedict.idx
This line should have a ending '\n'.

StarDict-2.4.8开始支持缓存机制，这个功能可以提升载入速度，并且节省程序运行所需内存空间（通过mmap()函数映射缓存文件）。缓存文件的后缀是.idx.oft 和 .syn.oft，格式如下（以“｜”分隔为三部分）：
a utf-8 string+'\0' | 32-bits numbers... | "ENTR_PER_PAGE=32"

第一部分是以'\0' 结尾的utf-8编码字符串，具体内空看下面；

第二部分是32位的偏移地址索引集合，这些索引是稀疏的；

第三分部是字符串"ENTR_PER_PAGE=32"。

第一部分的字符串内容：
首先开头是
=====
StarDict's oft file
version=2.4.8
=====
接着一行类似于
url=/usr/share/stardict/dic/stardict-somedict-2.4.2/somedict.idx
这一行以'\n'结尾。（kemin:以'\n'结尾与'\0'结尾有什么区别？）

StarDict will try to create the .oft file at the same directory of the .ifo file first, if failed, then try to create it at
~/.cache/stardict/, ~/.cache is get by g_get_user_cache_dir().
If two or more dictionaries have the same file name, StarDict will create somedict.idx.oft, somedict(2).idx.oft, somedict(3).idx.oft,
etc. for them respectively, each with different "url=" in the beginning string.

Stardict会尝试在ifo相同目录创建缓存文件，如果失败就在~/.cache/stardict/ 创建（ ~/.cache/可调用g_get_user_cache_dir()得到）。
如果有多个相同词典文件名，StarDict会分别创建somedict.idx.oft, somedict(2).idx.oft, somedict(3).idx.oft，对应不同的url值。

{6}. The collation file's format.校对文件格式

StarDict-2.4.8 start to support collation, that sort the word list by collate function. It will create collation file which names .idx.clt and .syn.clt, the format is a little like offset cache file:

StarDict从2.4.8版开始支持校对功能，也就是利用校对函数对单词列表进行排序。它会创建两个校对文件：.idx.clt 和 .syn.clt，文件格式与偏移地址缓存文件有点类似：

First a utf-8 string terminated by '\0', then many 32-bits numbers as the index that sorted by the collate function, they are not stored
in network byte order.
The string must begin with:
=====
StarDict's clt file
version=2.4.8
=====
Then two lines like this:
url=/usr/share/stardict/dic/stardict-somedict-2.4.2/somedict.idx
func=0
The second line should have a ending '\n' too.

StarDict support these collate functions currently:
typedef enum {
UTF8_GENERAL_CI = 0,
UTF8_UNICODE_CI,
UTF8_BIN,
UTF8_CZECH_CI,
UTF8_DANISH_CI,
UTF8_ESPERANTO_CI,
UTF8_ESTONIAN_CI,
UTF8_HUNGARIAN_CI,
UTF8_ICELANDIC_CI,
UTF8_LATVIAN_CI,
UTF8_LITHUANIAN_CI,
UTF8_PERSIAN_CI,
UTF8_POLISH_CI,
UTF8_ROMAN_CI,
UTF8_ROMANIAN_CI,
UTF8_SLOVAK_CI,
UTF8_SLOVENIAN_CI,
UTF8_SPANISH_CI,
UTF8_SPANISH2_CI,
UTF8_SWEDISH_CI,
UTF8_TURKISH_CI,
COLLATE_FUNC_NUMS
} CollateFunctions;
These UTF8_*_CI functions comes from MySQL in fact.

The file's locate path just like the .oft file.

Notice, for "somedict.idx.gz" file, the corresponding collation file is somedict.idx.clt, but not somedict.idx.gz.clt, the
"url=" is somedict.idx, not somedict.idx.gz. So after you gzip the .idx file, StarDict needn't create the .clt file again. ^

{7}. The ".dict" file's format.字典数据文件的格式

The .dict file is a pure data sequence, as the offset and size of each word is recorded in the corresponding .idx file.

If the "sametypesequence" option is not used in the .ifo file, then the .dict file has fields in the following order:
==============
word_1_data_1_type; // a single char identifying the data type
word_1_data_1_data; // the data
word_1_data_2_type;
word_1_data_2_data;
...... // the number of data entries for each word is determined by
// word_data_size in .idx file
word_2_data_1_type;
word_2_data_1_data;
......
==============
It's important to note that each field in each word indicates its own length, as described below. The number of possible fields per word is also not fixed, and is determined by simply reading data until you've read word_data_size bytes for that word.

.dict是纯数据序列，每个词条的偏移地址和大小都记录在相应的.idx文件中。如果“sametypesequence”选项没有在.ifo文件中设置，那么在.dict文件中将按以下格式排列：
==============
word_1_data_1_type; // 单字符，用于标识数据类型；
word_1_data_1_data; // 单词数据；
word_1_data_2_type;
word_1_data_2_data;
...... // 每个单词的数据项的数目由索引文件中的word_data_size决定；
word_2_data_1_type;
word_2_data_1_data;
......
==============
这里必须注意的是每个单词的每个域字段都会标出自己的长度（因为字段本身的数据是不定的，因为每个单词的释义不可能都一样），像下面提到的。每个单词的所拥有的字段数目也是不定的，不过只需要简单通过索引文件中的单词数据大小（word_data_size）可得知（没读到数据最后证明后面还有字段）。

Suppose the "sametypesequence" option is used in the .idx file, and the option is set like this:
sametypesequence=tm
Then the .dict file will look like this:
==============
word_1_data_1_data
word_1_data_2_data
word_2_data_1_data
word_2_data_2_data
......
==============
The first data entry for each word will have a terminating '\0', but the second entry will not have a terminating '\0'. The omissions of
the type chars and of the last field's size information are the optimizations required by the "sametypesequence" option described
above.

如果sametypesequence有设置，那么像这样：
sametypesequence=tm
文件格式就会这样：
==============
word_1_data_1_data
word_1_data_2_data
word_2_data_1_data
word_2_data_2_data
......
==============
word_1_data_1_data与word_1_data_1_data间有'\0'分隔；
word_1与word_2之间没有没有word_1的长度，因为已经记录在索引文件中；
word_1与word_2内部没有类型标识，因为都是t或m。

If "idxoffsetbits=64", the file size of the .dict file will be bigger than 4G. Because we often need to mmap this large file, and there is
a 4G maximum virtual memory space limit in a process on the 32 bits computer, which will make we can get error, so "idxoffsetbits=64"
dictionary can't be loaded in 32 bits machine in fact, StarDict will simply print a warning in this case when loading. 64-bits computers
should haven't this limit.

Type identifiers
----------------
Here are the single-character type identifiers that may be used with the "sametypesequence" option in the .idx file, or may appear in the
dict file itself if the "sametypesequence" option is not used.

Lower-case characters signify that a field's size is determined by a terminating '\0', while upper-case characters indicate that the data
begins with a network byte-ordered guint32 that gives the length of the following data's size(NOT the whole size which is 4 bytes bigger).

'm'
Word's pure text meaning.
The data should be a utf-8 string ending with '\0'.

'l'
Word's pure text meaning.
The data is NOT a utf-8 string, but is instead a string in locale encoding, ending with '\0'. Sometimes using this type will save disk
space, but its use is discouraged.

'g'
A utf-8 string which is marked up with the Pango text markup language.
For more information about this markup language, See the "Pango Reference Manual."
You might have it installed locally at:
file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html

't'
English phonetic string.
The data should be a utf-8 string ending with '\0'.

Here are some utf-8 phonetic characters:
¸ƒK§ð’æ1ŒŠR[YQ\TÌÈÐÑ
æQRŒÙTKv¸ðƒ’ZÐaÏÊË

'x'
A utf-8 string which is marked up with the xdxf language.
See http://xdxf.sourceforge.net
StarDict have these extention:
<rref> can have "type" attribute, it can be "image", "sound", "video"
and "attach".
<kref> can have "k" attribute.

'y'
Chinese YinBiao or Japanese KANA.
The data should be a utf-8 string ending with '\0'.

'k'
KingSoft PowerWord's data. The data is a utf-8 string ending with '\0'.
It is in XML format.

'w'
MediaWiki markup language.
See http://meta.wikimedia.org/wiki/Help:Editing#The_wiki_markup

'h'
Html codes.

'r'
Resource file list.
The content can be:
img:pic/example.jpg // Image file
snd:apple.wav // Sound file
vdo:film.avi // Video file
att:file.bin // Attachment file
More than one line is supported as a list of available files.
StarDict will find the files in the Resource Storage.
The image will be shown, the sound file will have a play button.
You can "save as" the attachment file and so on.

'W'
wav file.
The data begins with a network byte-ordered guint32 to identify the wav file's size, immediately followed by the file's content.

'P'
Picture file.
The data begins with a network byte-ordered guint32 to identify the picture file's size, immediately followed by the file's content.

'X'
this type identifier is reserved for experimental extensions.

^

{8}. Resource Storage

Resource Storage store the external file in 'r' resource file list, the image in html code, the image, media and other files in wiki tag.
It have two forms:
1. Direct directory and files in the "res" sub-directory.
2. The res.rifo, res.ridx and res.rdic database.
Direct files may have file name encoding problem, as Linux use UTF-8 and Windows use local encoding, so you'd better just use ASCII file name, or use databse to store UTF-8 file name.
Databse may need to extract the file(such as .wav) file to a temporary file, so not so efficient compare to direct files. But database have the
advantage of compressing.
You can convert the res directory and the res database from each other by the dir2resdatabse and resdatabase2dir tools.
StarDict will try to load the storage database first, then try the direct files form.

The format of the res.rifo file:
StarDict's storage ifo file
version=3.0.0
filecount= // required.
idxoffsetbits= // optional.

The format of the res.ridx file:
filename; // A string end with '\0'.
offset; // 32 or 64 bits unsigned number in network byte order.
size; // 32 bits unsigned number in network byte order.
filename can include a path too, such as "pic/example.png". filename is case sensitive, and there should have no two same filenames in all the
entries.
if "idxoffsetbits=64", then offset is 64 bits.
These three items are repeated as each entry.
The entries are sorted by the strcmp() function with the filename field.
It is possible that different filenames have the same offset and size.

The format of the res.rdic file:
It is just the join of each resource files.
You can dictzip this file as res.rdic.dz

^

{9}. Tree Dictionary

The tree dictionary support is used for information viewing, etc.

A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz and sometreedict.dict.dz.

It is better to compress the .tdx file, as it is always load into memory.

The .ifo file has the following format:

StarDict's treedict ifo file
version=2.4.2
[options]

Available options:

bookname= // required
tdxfilesize= // required
wordcount=
author=
email=
website=
description=
date=
sametypesequence=

wordcount is only used for info view in the dict manage dialog, so it is not important in tree dictionary.

The .tdx file is just the word list.
-----------
The word list is a tree list of word entries.

Each entry in the word list contains four fields, one after the other:
word_str; // a utf-8 string terminated by '\0'.
word_data_offset; // word data's offset in .dict file
word_data_size; // word data's total size in .dict file. it can be 0.
word_subentry_count; //how many sub word this entry has, 0 means none.

Subentry is immidiately followed by its parent entry. This make the order is just as when a tree list with all its nodes extended, then sort from top to bottom.

word_data_offset, word_data_size and word_subentry_count should be 32-bits unsigned numbers in network byte order.

The .dict file's format is the same as the normal dictionary.

^

{10}. More information.

You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and "src/tools/*.cpp" for more information.

After you have build a dictionary, you can use "stardict_verify" to verify the dictionary files. You can find it at "src/tools/".

If you have any questions, email me. :)

Thanks to Will Robinson <wsr23@stanford.edu> for cleaning up this file's English.

Hu Zheng <huzheng_001@163.com>
http://forlinux.yeah.net
2007.4.24 ^

分享到：

重大质疑：百度的索引是否出现了重大的技术 ... | 数据库管理系统

2008-10-12 18:32
浏览 1178
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论