Guessing Languages in Transifex

Transifex is all about automations, that makes developers’ life easier. Many guessing functions exist that cooperate closely with each other in order to give the end user this **essence of magic**. One of those is the guess_language method that is responsible for guessing the language of a particular file given its filepath (the filename variable it takes as an input).

Currently when guess_language code is called the following code is executed;

    def guess_language(self, filename):
        """Guess a language from the filename."""
        if 'LC_MESSAGES' in filename:
            fp = filename.split('LC_MESSAGES')
            return os.path.basename(fp[0][:-1:])
        else:
            return os.path.basename(filename[:-3:])
  • If the filename contains LC_MESSAGES return the first portion of the string (ie when filename=”alanguage/LC_MESSAGES/lala.po” will return “alanguage”)
  • else it will return the last prefix of the file (that is the filename without its suffix(or at least the last three characters of it) and without its containing directories) This is OK since the filenames we are feeding this functions with are always ending with “.po”

So far some of our users [example] for their own reasons follow a different approach on storing their po files so we had to make a workaround in order to make them happier. As stated on the ticket we thought of many approaches on how to solve this efficiently;

  1. Alter the component and add to it a field that contains a regular expression, with which the guess language can find out about the language. Alter the guess_language and make it totally dumb, and just apply the regular expression given by the user in order to identify the language of the file (Component Model Implementation – We will refer to this as CMI also)
  2. Instead of passing the problem to the user it is better to just enhance the guess_language to identify more filename structures (GuessLanguageEnhancement – We will refer to this as GLE also )

The first approach would require us to alter the component model by adding one more field ie:

class Component(models.Model):

    """A component is a translatable resource."""

    slug = models.SlugField(_('Slug'), max_length=30,
        help_text=_('A short label to be used in the URL, containing only '
                    'letters, numbers, underscores or hyphens.'))
    ...
    langname_filter = models.CharField(_('File filter'), max_length=70,
        help_text=_("A regex to find the language ie: "
        "'.+/%(lang)s/LC_MESSAGES/%(fbase)s.po$'"))
    ...

and then on the guess_language function:

    def guess_language(self, filename, regex):
        m = re.match(regex, filename)
        if m:
            return m.group('lang')
        return None

This would give great flexibility and power to the end/power user to define his own regex’es to identify strange/uncommon l10n directory structure.

The second approach, is to add statically some regex’es in the guess_language function and the first matching should return the guessed language as result.

For instance the guess_language should become:

    def guess_language(self, filename):
        """Guess a language from the filename.

        A number of regexxes will be applied, and the first one which matches
        is the winner.
        """
        LANG_RE = r"(?Pb(?!pob)[-_@w]+)"
        FBASE_RE = r"(?P[^/]+)" # No slashes or dots.
        RE_LIST = (
            #^el.po Notice the ^, only matching in beginning.
            r"^%(lang)s.po$",
            #^po/el.po: Notice the ^, only matching in beginning.
            r"^po/%(lang)s.po$",
            #.../po/el.po: Too generic to avoid the po/ prefix.
            r".+/po/%(lang)s.po$",
            #.../el/LC_MESSAGES/foo.po
            r".+/%(lang)s/LC_MESSAGES/%(fbase)s.po$",
            #.../po/el/foo.po
            r".+/%(lang)s/%(fbase)s.po$",
            #.../po/foo.el.po
            r".+/%(fbase)s.%(lang)s.po$",
        )
        # Inject the language regex into the list elements
        RE_LIST = [re.compile(r % {'lang': LANG_RE, 'fbase': FBASE_RE})
                              for r in RE_LIST]
        for regex in RE_LIST:
            m = regex.match(filename)
            if m:
                return m.group('lang')

The first feeling with this approach is that this would be really slow since the guess_language method is called for every single po file. But with a quick trick this became much faster since we can compile the regular expressions outside the class scope and thus calculated only once. So the guess_language becomes:

LANG_RE = r"(?Pb(?!pob)[-_@w]+)"
FBASE_RE = r"(?P[^/]+)" # No slashes or dots.
# List of regex to match. These should include a %(lang)s where the
# language locale should appear.
RE_LIST = (
    #^el.po Notice the ^, only matching in beginning.
    r"^%(lang)s.po$",
    #^po/el.po: Notice the ^, only matching in beginning.
    r"^po/%(lang)s.po$",
    #.../po/el.po: Too generic to avoid the po/ prefix.
    r".+/po/%(lang)s.po$",
    #.../el/LC_MESSAGES/foo.po
    r".+/%(lang)s/LC_MESSAGES/%(fbase)s.po$",
    #.../po/el/foo.po
    r".+/%(lang)s/%(fbase)s.po$",
    #.../po/foo.el.po
    r".+/%(fbase)s.%(lang)s.po$",
)
# Inject the language regex into the list elements
RE_LIST = [re.compile(r % {'lang': LANG_RE, 'fbase': FBASE_RE})
                      for r in RE_LIST]
...
class POTManager(TransManagerMixin):
    ...
    def guess_language(self, filename):
        """Guess a language from the filename.

        A number of regexxes will be applied, and the first one which matches
        is the winner.
        """

        for regex in RE_LIST:
            m = regex.match(filename)
            if m:
                return m.group('lang')
        ...

Now we have an even faster approach (4 times faster to be exact) in comparison to the original implementation. For 35k po files (the amount of po files living inside Transifex.net) the new regular expression based guess language is only two times slower in comparison to the original implementation and is capable of recognizing 8 more l10n file structures.

The question now is how is this competing to the CMI. With some crude tests the results are;

  • for 35k files the CMI implementation is three times slower compared to the GLE approach. This holds because while the CMI has less computation at each guess_language execution, it has to compute each regular expression in each consecutive call while the GLE computes them once and for all at the bootup of the application.
  • for 10k files the CMI implementation is 2.5 times slower compared to GLE approach.

The tests parameters were transifex po files paths given to the two implementations of guess_language. The regular expressions used in each case were those that GLE supports (for the CMI one regex was defined for each component object)

So the conclusion was, that besides the limitations enforced to the user using the GLE approach (again only predefined directory structures are supported), the performance merits were really attractive to be left out. Also, while CMI provides greater flexibility, this approach would be more confusing for most people. Writing regular expressions is usually an error prone and burdensome task  so we went with the GLE approach for implementing this feature, given the ease of use and the performance metrics.

This more intuitive directory structure support **is planned to land on Transifex.net along with other amazing features on the  20th of April when the next release is due ( **if all goes well 😉 )

Of course if you have any more ideas about this particular snippet of code, corrections or suggestions on how to make it faster don’t hesitate to leave us a comment.

What are you waiting for? Sign up for your 30-day free trial now.

TRY IT FOR FREE
REQUEST DEMO

Request a Demo

Tell us a bit about yourself and we’ll be in touch soon!