Bitextor: the automatic bitext generator
What is Bitextor?
Bitextor is an automatic bitext generator which obtains his base
corpora from the Internet. It works downloading an entire website
(applying a filter to download only those files written in HTML)
and comparing every pair of files. It detects the language and,
through a group of heuristics (file size, HTML skeleton edit
distance, format, etc.), it tries to guess witch files have the
same content in different languages. Once it has identified the
pairs of files, it generates a bitext file in TMX format.
The objective of this application is to provide a simple way to
obtain bilingual corpora in a semi-supervised way to train
automatic traducers.
Who develops Bitextor?
Bitextor is a project of the Department of Software and Computing
Systems (University of Alacant). There has been two different
versions of Bitextor. The initial version (1.0) was developed by
Enrique Sanchez Villamil. In his second version, Bitextor has been
re-dessigned and developed from zero maintaining only the original
idea. This second development has been performed by Miquel Esplà i
Gomis.
The two versions have been supervised by Mikel Forcada Zubizarreta,
member of the Transducens research team in
the University of Alacant.
Downloads
The 3.0.0 version of Bitextor has been released on April 22, 2009
through the Bitextor SourceForge
web page. You can always follow development at the project's
SVN server.
Depending packages/applications
Bitextor has only been tested on GNU/Linux systems:
- Debian 4.0.
- Ubuntu 8.10.
- Fedora 9.
- OpenSuse 11.0.
- Mandriva One Linux 2009.
To run it you will need all the following libraries/applications:
- LibTRE
(packages 'libtre4' and 'libtre-dev').
- LibTidy (packages
'libtidy-0.99-0' and 'libtidy-dev').
- LibTextCat (
packages 'libtextcat0', 'libtextcat-data' and
'libtextcat-dev').
- LibXML2 ( packages
'libxml2' and 'libxml2-dev').
- LibEnca ( paquets 'libenca0' i 'libenca-dev').
- LibTagAligner
( package 'tag-aligner-3.1.0' )
- Httrack ( package
'httrack').
Building and Installing
- ./configure
- make
- make install
User documentation
You can find documentation for users at the project's Wiki.
License
This application is released under the GNU General Public
License .
 |
Last update: April 2009 |