Note: this information is outdated. Please visit the project wiki.

Bitextor: the automatic bitext generator

What is Bitextor?

Bitextor is an automatic bitext generator which obtains its base corpora from the Internet. It works by downloading an entire website (applying a filter to download only those files written in HTML) and comparing every pair of files. It detects the language and, through a group of heuristics (file size, HTML skeleton edit distance, format, etc.), it tries to guess which files have the same content in different languages. Once it has identified the pairs of files, it generates a bitext file in TMX format.

The objective of this application is to provide a simple way to obtain bilingual corpora in a semi-supervised way to train machine translation systems or to obtain translation memories to support computer-aided translation.

Who develops Bitextor?

Bitextor is a project of the Department of Software and Computing Systems (University of Alacant). There have been two different versions of Bitextor. The initial version (1.0) was developed by Enrique Sanchez Villamil. In its second version, Bitextor was re-designed and developed from scratch maintaining only the original idea. This second development has been performed by Miquel Esplà i Gomis.

The two versions have been supervised by Mikel L. Forcada, member of the Transducens research team in the Universitat of Alacant.


The 3.0.0 version of Bitextor has been released on April 22, 2009 through the Bitextor SourceForge web page. You can always follow development at the project's SVN server.

Dependencies and requirements

Bitextor has only been tested on GNU/Linux systems: To run it you will need all the following libraries/applications:

Building and installing

User documentation

You can find documentation for users at the project's Wiki.


This application is released under the GNU General Public License .

