Creating a terminology list from your existing translations

If you did not create a terminology list when you started your translation project or if you have inherited some old translations you probably now want to create a terminology list.

A terminology list or glossary is a list of words and phrases with their expected translation. They are useful for ensuring that your translations are consistent across your project.

With existing translations you have embedded a list of valid translation. This example will help you to extract the terms. It is only the first step you will need to review the terms and must not regard this as a complete list. And of course you would want to take your corrections and feed them back into the original translations.

Quick Overview

This describes a multi-stage process for extracting terminology from translation files. It is provided for historical interest and completeness, but you will probably find that using poterminology is easier and will give better results than following this process.

  • Filter our phrases of more than N words

  • Remove obviously erroneous phrases such as numbers and punctuation

  • Create a single PO compendium

  • Extract and review items that are fuzzy and drop untranslated items

  • Create a new PO files and process into CSV and TMX format

Get short phrases from the current translations

We will not be able to identify terminology within bodies of text, we are only going to extract short bit of text i.e. ones that are between 1 and 3 words long.

pogrep --header --search=msgid -e '^\w+(\s+\w+){0,2}$' zulu zulu-short

We use --header to ensure that the PO files have a header entry (which is important for encoding). We are searching only in the msgid and the regular expression we use is looking for a string with between 1 and 3 words in it. We are searching through the folder zulu and outputting the result in zulu-short

Remove any translations with issues

You can for instance remove all entries with only a single letter. Useful for eliminating all those spurious accelerator keys.

pogrep --header --search=msgid -v -e "^.$" zulu-short zulu-short-clean

We use the -v option to invert the search. Our cleaner potential glossary words are now in zulu-short-clean. What you can eliminate is only limited by your ability to build regular expressions but yu could eliminate:

  • Entries with only numbers

  • Entries that only contain punctuation

Create a compendium

Now that we have our words we want to create a single files of all terminology. Thus we create a PO compendium:

~/path/to/pocompendium -i -su zulu-gnome-glossary.po -d zulu-short-clean

You can use various methods but our bash script is quite good. Here we ignore case, -i, and ignore the underscore (_) accelerator key, -su, outputting the results in.

We now have a single file containing all glossary terms and the clean up and review can begin.

Split the file

We want to split the file into translated, untranslated and fuzzy entries:

~/path/to/posplit ./zulu-gnome-glossary.po

This will create three files:

  • zulu-gnome-glossary-translated.po – all fully translated entries

  • zulu-gnome-glossary-untranslated.po – messages with no translation

  • zulu-gnome-glossary-fuzzy.po – words that need investigation

rm zulu-gnome-glossary-untranslated.po

We discard zulu-gnome-glossary-untranslated.po since they are of no use to us.

Dealing with the fuzzies

The fuzzies come in two kinds. Those that are simply wrong or needed updating and those where there was more then one translation for a given term. So if someone had translated ‘File’ differently across the translations we’d have an entry that was marked fuzzy with the two options displayed.

pofilter -t compendiumconflicts zulu-gnome-glossary-fuzzy.po zulu-gnome-glossary-conflicts.po

These compendium conflicts are what we are interested in so we use pofilter to filter them from the other fuzzies.

rm zulu-gnome-glossary-fuzzy.po

We discard the other fuzzies as they where probably wrong in the first place. You could review these but it is not recommended.

Now edit zulu-gnome-glossary-conflicts.po to resolve the conflicts. You can edit them however you like but we usually follow the format:

option1, option2, option3

You can get them into that layout by doing the following:

sed '/#, fuzzy/d; /\"#-#-#-#-# /d; /# (pofilter) compendiumconflicts:/d; s/\\n"$/, "/' zulu-gnome-glossary-conflicts.po > tmp.po
msgcat tmp.po > zulu-gnome-glossary-conflicts.po

Of course if a word is clearly wrong, misspelled etc. then you can eliminate it. Often you will find the “problem” relates to the part of speech of the source word and that indeed there are two options depending on the context.

You now have a cleaned fuzzy file and we are ready to proceed.

Put it back together again

msgcat zulu-gnome-glossary-translated.po zulu-gnome-glossary-conflicts.po > zulu-gnome-glossary.po

We now have a single file zulu-gnome-glossary.po which contains our glossary texts.

Create other formats

It is probably good to make your terminology available in other formats. You can create CSV and TMX files from your PO.

po2csv zulu-gnome-glossary.po zulu-gnome-glossary.csv
po2tmx -l zu zulu-gnome-glossary.po zulu-gnome-glossary.tmx

For the terminology to be usable by Trados or Wordfast translators they need to be in the following formats:

  • Trados – comma delimited file source,target

  • Wordfast – tab delimited file source[tab]target

In that format they are now available to almost all localisers in the world.

FIXME need scripts to generate these formats.

The work has only just begun

The lists you have just created are useful in their own right. But you most likely want to keep growing them, cleaning and improving them.

You should as a first step review what you have created and fix spelling and other errors or disambiguate terms as needed.

But congratulations a Terminology list or Glossary is one of your most important assets for creating good and consistent translations and it acts as a valuable resource for both new and experienced translators when they need prompting as to how to translate a term.