Commit | Line | Data |
9378c581 |
1 | |
2 | # This document contains text in Perl "POD" format. |
3 | # Use a POD viewer like perldoc or perlman to render it. |
4 | |
5 | =head1 NAME |
6 | |
7 | Locale::Maketext::TPJ13 -- article about software localization |
8 | |
9 | =head1 SYNOPSIS |
10 | |
11 | # This an article, not a module. |
12 | |
13 | =head1 DESCRIPTION |
14 | |
15 | The following article by Sean M. Burke and Jordan Lachler |
16 | first appeared in I<The Perl |
17 | Journal> #13 and is copyright 1999 The Perl Journal. It appears |
18 | courtesy of Jon Orwant and The Perl Journal. This document may be |
19 | distributed under the same terms as Perl itself. |
20 | |
21 | =head1 Localization and Perl: gettext breaks, Maketext fixes |
22 | |
23 | by Sean M. Burke and Jordan Lachler |
24 | |
25 | This article points out cases where gettext (a common system for |
26 | localizing software interfaces -- i.e., making them work in the user's |
27 | language of choice) fails because of basic differences between human |
28 | languages. This article then describes Maketext, a new system capable |
29 | of correctly treating these differences. |
30 | |
31 | =head2 A Localization Horror Story: It Could Happen To You |
32 | |
33 | =over |
34 | |
35 | "There are a number of languages spoken by human beings in this |
36 | world." |
37 | |
38 | -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the |
39 | Identification of Languages" |
40 | |
41 | =back |
42 | |
43 | Imagine that your task for the day is to localize a piece of software |
44 | -- and luckily for you, the only output the program emits is two |
45 | messages, like this: |
46 | |
47 | I scanned 12 directories. |
48 | |
49 | Your query matched 10 files in 4 directories. |
50 | |
d1be9408 |
51 | So how hard could that be? You look at the code that |
9378c581 |
52 | produces the first item, and it reads: |
53 | |
54 | printf("I scanned %g directories.", |
55 | $directory_count); |
56 | |
57 | You think about that, and realize that it doesn't even work right for |
58 | English, as it can produce this output: |
59 | |
60 | I scanned 1 directories. |
61 | |
62 | So you rewrite it to read: |
63 | |
64 | printf("I scanned %g %s.", |
65 | $directory_count, |
66 | $directory_count == 1 ? |
67 | "directory" : "directories", |
68 | ); |
69 | |
70 | ...which does the Right Thing. (In case you don't recall, "%g" is for |
71 | locale-specific number interpolation, and "%s" is for string |
72 | interpolation.) |
73 | |
74 | But you still have to localize it for all the languages you're |
75 | producing this software for, so you pull Locale::gettext off of CPAN |
76 | so you can access the C<gettext> C functions you've heard are standard |
77 | for localization tasks. |
78 | |
79 | And you write: |
80 | |
81 | printf(gettext("I scanned %g %s."), |
82 | $dir_scan_count, |
83 | $dir_scan_count == 1 ? |
84 | gettext("directory") : gettext("directory"), |
85 | ); |
86 | |
87 | But you then read in the gettext manual (Drepper, Miller, and Pinard 1995) |
88 | that this is not a good idea, since how a single word like "directory" |
89 | or "directories" is translated may depend on context -- and this is |
90 | true, since in a case language like German or Russian, you'd may need |
91 | these words with a different case ending in the first instance (where the |
92 | word is the object of a verb) than in the second instance, which you haven't even |
93 | gotten to yet (where the word is the object of a preposition, "in %g |
94 | directories") -- assuming these keep the same syntax when translated |
95 | into those languages. |
96 | |
97 | So, on the advice of the gettext manual, you rewrite: |
98 | |
99 | printf( $dir_scan_count == 1 ? |
100 | gettext("I scanned %g directory.") : |
101 | gettext("I scanned %g directories."), |
102 | $dir_scan_count ); |
103 | |
104 | So, you email your various translators (the boss decides that the |
105 | languages du jour are Chinese, Arabic, Russian, and Italian, so you |
106 | have one translator for each), asking for translations for "I scanned |
107 | %g directory." and "I scanned %g directories.". When they reply, |
108 | you'll put that in the lexicons for gettext to use when it localizes |
109 | your software, so that when the user is running under the "zh" |
110 | (Chinese) locale, gettext("I scanned %g directory.") will return the |
111 | appropriate Chinese text, with a "%g" in there where printf can then |
112 | interpolate $dir_scan. |
113 | |
114 | Your Chinese translator emails right back -- he says both of these |
115 | phrases translate to the same thing in Chinese, because, in linguistic |
116 | jargon, Chinese "doesn't have number as a grammatical category" -- |
117 | whereas English does. That is, English has grammatical rules that |
118 | refer to "number", i.e., whether something is grammatically singular |
119 | or plural; and one of these rules is the one that forces nouns to take |
120 | a plural suffix (generally "s") when in a plural context, as they are when |
121 | they follow a number other than "one" (including, oddly enough, "zero"). |
122 | Chinese has no such rules, and so has just the one phrase where English |
123 | has two. But, no problem, you can have this one Chinese phrase appear |
124 | as the translation for the two English phrases in the "zh" gettext |
125 | lexicon for your program. |
126 | |
127 | Emboldened by this, you dive into the second phrase that your software |
128 | needs to output: "Your query matched 10 files in 4 directories.". You notice |
129 | that if you want to treat phrases as indivisible, as the gettext |
130 | manual wisely advises, you need four cases now, instead of two, to |
131 | cover the permutations of singular and plural on the two items, |
132 | $dir_count and $file_count. So you try this: |
133 | |
134 | printf( $file_count == 1 ? |
135 | ( $directory_count == 1 ? |
136 | gettext("Your query matched %g file in %g directory.") : |
137 | gettext("Your query matched %g file in %g directories.") ) : |
138 | ( $directory_count == 1 ? |
139 | gettext("Your query matched %g files in %g directory.") : |
140 | gettext("Your query matched %g files in %g directories.") ), |
141 | $file_count, $directory_count, |
142 | ); |
143 | |
144 | (The case of "1 file in 2 [or more] directories" could, I suppose, |
145 | occur in the case of symlinking or something of the sort.) |
146 | |
147 | It occurs to you that this is not the prettiest code you've ever |
148 | written, but this seems the way to go. You mail off to the |
149 | translators asking for translations for these four cases. The |
150 | Chinese guy replies with the one phrase that these all translate to in |
151 | Chinese, and that phrase has two "%g"s in it, as it should -- but |
152 | there's a problem. He translates it word-for-word back: "To your |
153 | question, in %g directories you would find %g answers." The "%g" |
154 | slots are in an order reverse to what they are in English. You wonder |
155 | how you'll get gettext to handle that. |
156 | |
157 | But you put it aside for the moment, and optimistically hope that the |
158 | other translators won't have this problem, and that their languages |
159 | will be better behaved -- i.e., that they will be just like English. |
160 | |
161 | But the Arabic translator is the next to write back. First off, your |
162 | code for "I scanned %g directory." or "I scanned %g directories." |
163 | assumes there's only singular or plural. But, to use linguistic |
164 | jargon again, Arabic has grammatical number, like English (but unlike |
165 | Chinese), but it's a three-term category: singular, dual, and plural. |
166 | In other words, the way you say "directory" depends on whether there's |
167 | one directory, or I<two> of them, or I<more than two> of them. Your |
168 | test of C<($directory == 1)> no longer does the job. And it means |
169 | that where English's grammatical category of number necessitates |
170 | only the two permutations of the first sentence based on "directory |
171 | [singular]" and "directories [plural]", Arabic has three -- and, |
172 | worse, in the second sentence ("Your query matched %g file in %g |
173 | directory."), where English has four, Arabic has nine. You sense |
174 | an unwelcome, exponential trend taking shape. |
175 | |
176 | Your Italian translator emails you back and says that "I searched 0 |
177 | directories" (a possible English output of your program) is stilted, |
178 | and if you think that's fine English, that's your problem, but that |
179 | I<just will not do> in the language of Dante. He insists that where |
180 | $directory_count is 0, your program should produce the Italian text |
181 | for "I I<didn't> scan I<any> directories.". And ditto for "I didn't |
182 | match any files in any directories", although he says the last part |
183 | about "in any directories" should probably just be left off. |
184 | |
185 | You wonder how you'll get gettext to handle this; to accomodate the |
186 | ways Arabic, Chinese, and Italian deal with numbers in just these few |
187 | very simple phrases, you need to write code that will ask gettext for |
188 | different queries depending on whether the numerical values in |
189 | question are 1, 2, more than 2, or in some cases 0, and you still haven't |
190 | figured out the problem with the different word order in Chinese. |
191 | |
192 | Then your Russian translator calls on the phone, to I<personally> tell |
193 | you the bad news about how really unpleasant your life is about to |
194 | become: |
195 | |
196 | Russian, like German or Latin, is an inflectional language; that is, nouns |
197 | and adjectives have to take endings that depend on their case |
198 | (i.e., nominative, accusative, genitive, etc...) -- which is roughly a matter of |
199 | what role they have in syntax of the sentence -- |
200 | as well as on the grammatical gender (i.e., masculine, feminine, neuter) |
201 | and number (i.e., singular or plural) of the noun, as well as on the |
202 | declension class of the noun. But unlike with most other inflected languages, |
203 | putting a number-phrase (like "ten" or "forty-three", or their Arabic |
204 | numeral equivalents) in front of noun in Russian can change the case and |
205 | number that noun is, and therefore the endings you have to put on it. |
206 | |
207 | He elaborates: In "I scanned %g directories", you'd I<expect> |
208 | "directories" to be in the accusative case (since it is the direct |
209 | object in the sentnce) and the plural number, |
210 | except where $directory_count is 1, then you'd expect the singular, of |
211 | course. Just like Latin or German. I<But!> Where $directory_count % |
212 | 10 is 1 ("%" for modulo, remember), assuming $directory count is an |
213 | integer, and except where $directory_count % 100 is 11, "directories" |
214 | is forced to become grammatically singular, which means it gets the |
215 | ending for the accusative singular... You begin to visualize the code |
216 | it'd take to test for the problem so far, I<and still work for Chinese |
217 | and Arabic and Italian>, and how many gettext items that'd take, but |
218 | he keeps going... But where $directory_count % 10 is 2, 3, or 4 |
219 | (except where $directory_count % 100 is 12, 13, or 14), the word for |
220 | "directories" is forced to be genitive singular -- which means another |
221 | ending... The room begins to spin around you, slowly at first... But |
222 | with I<all other> integer values, since "directory" is an inanimate |
223 | noun, when preceded by a number and in the nominative or accusative |
224 | cases (as it is here, just your luck!), it does stay plural, but it is |
d1be9408 |
225 | forced into the genitive case -- yet another ending... And |
9378c581 |
226 | you never hear him get to the part about how you're going to run into |
227 | similar (but maybe subtly different) problems with other Slavic |
228 | languages like Polish, because the floor comes up to meet you, and you |
229 | fade into unconsciousness. |
230 | |
231 | |
232 | The above cautionary tale relates how an attempt at localization can |
233 | lead from programmer consternation, to program obfuscation, to a need |
234 | for sedation. But careful evaluation shows that your choice of tools |
235 | merely needed further consideration. |
236 | |
237 | =head2 The Linguistic View |
238 | |
239 | =over |
240 | |
241 | "It is more complicated than you think." |
242 | |
243 | -- The Eighth Networking Truth, from RFC 1925 |
244 | |
245 | =back |
246 | |
247 | The field of Linguistics has expended a great deal of effort over the |
248 | past century trying to find grammatical patterns which hold across |
249 | languages; it's been a constant process |
250 | of people making generalizations that should apply to all languages, |
251 | only to find out that, all too often, these generalizations fail -- |
252 | sometimes failing for just a few languages, sometimes whole classes of |
253 | languages, and sometimes nearly every language in the world except |
254 | English. Broad statistical trends are evident in what the "average |
255 | language" is like as far as what its rules can look like, must look |
256 | like, and cannot look like. But the "average language" is just as |
257 | unreal a concept as the "average person" -- it runs up against the |
258 | fact no language (or person) is, in fact, average. The wisdom of past |
259 | experience leads us to believe that any given language can do whatever |
260 | it wants, in any order, with appeal to any kind of grammatical |
261 | categories wants -- case, number, tense, real or metaphoric |
262 | characteristics of the things that words refer to, arbitrary or |
263 | predictable classifications of words based on what endings or prefixes |
264 | they can take, degree or means of certainty about the truth of |
265 | statements expressed, and so on, ad infinitum. |
266 | |
267 | Mercifully, most localization tasks are a matter of finding ways to |
268 | translate whole phrases, generally sentences, where the context is |
269 | relatively set, and where the only variation in content is I<usually> |
270 | in a number being expressed -- as in the example sentences above. |
271 | Translating specific, fully-formed sentences is, in practice, fairly |
272 | foolproof -- which is good, because that's what's in the phrasebooks |
273 | that so many tourists rely on. Now, a given phrase (whether in a |
274 | phrasebook or in a gettext lexicon) in one language I<might> have a |
275 | greater or lesser applicability than that phrase's translation into |
276 | another language -- for example, strictly speaking, in Arabic, the |
277 | "your" in "Your query matched..." would take a different form |
278 | depending on whether the user is male or female; so the Arabic |
279 | translation "your[feminine] query" is applicable in fewer cases than |
280 | the corresponding English phrase, which doesn't distinguish the user's |
281 | gender. (In practice, it's not feasable to have a program know the |
282 | user's gender, so the masculine "you" in Arabic is usually used, by |
283 | default.) |
284 | |
285 | But in general, such surprises are rare when entire sentences are |
286 | being translated, especially when the functional context is restricted |
287 | to that of a computer interacting with a user either to convey a fact |
288 | or to prompt for a piece of information. So, for purposes of |
289 | localization, translation by phrase (generally by sentence) is both the |
290 | simplest and the least problematic. |
291 | |
292 | =head2 Breaking gettext |
293 | |
294 | =over |
295 | |
296 | "It Has To Work." |
297 | |
298 | -- First Networking Truth, RFC 1925 |
299 | |
300 | =back |
301 | |
302 | Consider that sentences in a tourist phrasebook are of two types: ones |
303 | like "How do I get to the marketplace?" that don't have any blanks to |
304 | fill in, and ones like "How much do these ___ cost?", where there's |
305 | one or more blanks to fill in (and these are usually linked to a |
306 | list of words that you can put in that blank: "fish", "potatoes", |
307 | "tomatoes", etc.) The ones with no blanks are no problem, but the |
308 | fill-in-the-blank ones may not be really straightforward. If it's a |
309 | Swahili phrasebook, for example, the authors probably didn't bother to |
310 | tell you the complicated ways that the verb "cost" changes its |
311 | inflectional prefix depending on the noun you're putting in the blank. |
312 | The trader in the marketplace will still understand what you're saying if |
313 | you say "how much do these potatoes cost?" with the wrong |
314 | inflectional prefix on "cost". After all, I<you> can't speak proper Swahili, |
315 | I<you're> just a tourist. But while tourists can be stupid, computers |
316 | are supposed to be smart; the computer should be able to fill in the |
317 | blank, and still have the results be grammatical. |
318 | |
319 | In other words, a phrasebook entry takes some values as parameters |
320 | (the things that you fill in the blank or blanks), and provides a value |
321 | based on these parameters, where the way you get that final value from |
322 | the given values can, properly speaking, involve an arbitrarily |
323 | complex series of operations. (In the case of Chinese, it'd be not at |
324 | all complex, at least in cases like the examples at the beginning of |
325 | this article; whereas in the case of Russian it'd be a rather complex |
326 | series of operations. And in some languages, the |
327 | complexity could be spread around differently: while the act of |
328 | putting a number-expression in front of a noun phrase might not be |
329 | complex by itself, it may change how you have to, for example, inflect |
330 | a verb elsewhere in the sentence. This is what in syntax is called |
331 | "long-distance dependencies".) |
332 | |
333 | This talk of parameters and arbitrary complexity is just another way |
334 | to say that an entry in a phrasebook is what in a programming language |
335 | would be called a "function". Just so you don't miss it, this is the |
336 | crux of this article: I<A phrase is a function; a phrasebook is a |
337 | bunch of functions.> |
338 | |
339 | The reason that using gettext runs into walls (as in the above |
340 | second-person horror story) is that you're trying to use a string (or |
341 | worse, a choice among a bunch of strings) to do what you really need a |
342 | function for -- which is futile. Preforming (s)printf interpolation |
343 | on the strings which you get back from gettext does allow you to do I<some> |
344 | common things passably well... sometimes... sort of; but, to paraphrase |
345 | what some people say about C<csh> script programming, "it fools you |
346 | into thinking you can use it for real things, but you can't, and you |
347 | don't discover this until you've already spent too much time trying, |
348 | and by then it's too late." |
349 | |
350 | =head2 Replacing gettext |
351 | |
352 | So, what needs to replace gettext is a system that supports lexicons |
353 | of functions instead of lexicons of strings. An entry in a lexicon |
354 | from such a system should I<not> look like this: |
355 | |
356 | "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires" |
357 | |
358 | [\xE9 is e-acute in Latin-1. Some pod renderers would |
359 | scream if I used the actual character here. -- SB] |
360 | |
361 | but instead like this, bearing in mind that this is just a first stab: |
362 | |
363 | sub I_found_X1_files_in_X2_directories { |
364 | my( $files, $dirs ) = @_[0,1]; |
365 | $files = sprintf("%g %s", $files, |
366 | $files == 1 ? 'fichier' : 'fichiers'); |
367 | $dirs = sprintf("%g %s", $dirs, |
368 | $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires"); |
369 | return "J'ai trouv\xE9 $files dans $dirs."; |
370 | } |
371 | |
372 | Now, there's no particularly obvious way to store anything but strings |
373 | in a gettext lexicon; so it looks like we just have to start over and |
374 | make something better, from scratch. I call my shot at a |
375 | gettext-replacement system "Maketext", or, in CPAN terms, |
376 | Locale::Maketext. |
377 | |
378 | When designing Maketext, I chose to plan its main features in terms of |
379 | "buzzword compliance". And here are the buzzwords: |
380 | |
381 | =head2 Buzzwords: Abstraction and Encapsulation |
382 | |
383 | The complexity of the language you're trying to output a phrase in is |
384 | entirely abstracted inside (and encapsulated within) the Maketext module |
385 | for that interface. When you call: |
386 | |
387 | print $lang->maketext("You have [quant,_1,piece] of new mail.", |
388 | scalar(@messages)); |
389 | |
390 | you don't know (and in fact can't easily find out) whether this will |
391 | involve lots of figuring, as in Russian (if $lang is a handle to the |
392 | Russian module), or relatively little, as in Chinese. That kind of |
393 | abstraction and encapsulation may encourage other pleasant buzzwords |
394 | like modularization and stratification, depending on what design |
395 | decisions you make. |
396 | |
397 | =head2 Buzzword: Isomorphism |
398 | |
399 | "Isomorphism" means "having the same structure or form"; in discussions |
400 | of program design, the word takes on the special, specific meaning that |
401 | your implementation of a solution to a problem I<has the same |
402 | structure> as, say, an informal verbal description of the solution, or |
403 | maybe of the problem itself. Isomorphism is, all things considered, |
404 | a good thing -- it's what problem-solving (and solution-implementing) |
405 | should look like. |
406 | |
407 | What's wrong the with gettext-using code like this... |
408 | |
409 | printf( $file_count == 1 ? |
410 | ( $directory_count == 1 ? |
411 | "Your query matched %g file in %g directory." : |
412 | "Your query matched %g file in %g directories." ) : |
413 | ( $directory_count == 1 ? |
414 | "Your query matched %g files in %g directory." : |
415 | "Your query matched %g files in %g directories." ), |
416 | $file_count, $directory_count, |
417 | ); |
418 | |
419 | is first off that it's not well abstracted -- these ways of testing |
420 | for grammatical number (as in the expressions like C<foo == 1 ? |
421 | singular_form : plural_form>) should be abstracted to each language |
422 | module, since how you get grammatical number is language-specific. |
423 | |
424 | But second off, it's not isomorphic -- the "solution" (i.e., the |
425 | phrasebook entries) for Chinese maps from these four English phrases to |
426 | the one Chinese phrase that fits for all of them. In other words, the |
427 | informal solution would be "The way to say what you want in Chinese is |
428 | with the one phrase 'For your question, in Y directories you would |
429 | find X files'" -- and so the implemented solution should be, |
430 | isomorphically, just a straightforward way to spit out that one |
431 | phrase, with numerals properly interpolated. It shouldn't have to map |
432 | from the complexity of other languages to the simplicity of this one. |
433 | |
434 | =head2 Buzzword: Inheritance |
435 | |
436 | There's a great deal of reuse possible for sharing of phrases between |
437 | modules for related dialects, or for sharing of auxiliary functions |
438 | between related languages. (By "auxiliary functions", I mean |
439 | functions that don't produce phrase-text, but which, say, return an |
440 | answer to "does this number require a plural noun after it?". Such |
441 | auxiliary functions would be used in the internal logic of functions |
442 | that actually do produce phrase-text.) |
443 | |
444 | In the case of sharing phrases, consider that you have an interface |
445 | already localized for American English (probably by having been |
446 | written with that as the native locale, but that's incidental). |
447 | Localizing it for UK English should, in practical terms, be just a |
448 | matter of running it past a British person with the instructions to |
449 | indicate what few phrases would benefit from a change in spelling or |
450 | possibly minor rewording. In that case, you should be able to put in |
451 | the UK English localization module I<only> those phrases that are |
452 | UK-specific, and for all the rest, I<inherit> from the American |
453 | English module. (And I expect this same situation would apply with |
454 | Brazilian and Continental Portugese, possbily with some I<very> |
455 | closely related languages like Czech and Slovak, and possibly with the |
456 | slightly different "versions" of written Mandarin Chinese, as I hear exist in |
457 | Taiwan and mainland China.) |
458 | |
459 | As to sharing of auxiliary functions, consider the problem of Russian |
460 | numbers from the beginning of this article; obviously, you'd want to |
461 | write only once the hairy code that, given a numeric value, would |
462 | return some specification of which case and number a given quanitified |
463 | noun should use. But suppose that you discover, while localizing an |
464 | interface for, say, Ukranian (a Slavic language related to Russian, |
465 | spoken by several million people, many of whom would be relieved to |
466 | find that your Web site's or software's interface is available in |
467 | their language), that the rules in Ukranian are the same as in Russian |
468 | for quantification, and probably for many other grammatical functions. |
469 | While there may well be no phrases in common between Russian and |
470 | Ukranian, you could still choose to have the Ukranian module inherit |
471 | from the Russian module, just for the sake of inheriting all the |
472 | various grammatical methods. Or, probably better organizationally, |
473 | you could move those functions to a module called C<_E_Slavic> or |
474 | something, which Russian and Ukranian could inherit useful functions |
475 | from, but which would (presumably) provide no lexicon. |
476 | |
477 | =head2 Buzzword: Concision |
478 | |
479 | Okay, concision isn't a buzzword. But it should be, so I decree that |
480 | as a new buzzword, "concision" means that simple common things should |
481 | be expressible in very few lines (or maybe even just a few characters) |
482 | of code -- call it a special case of "making simple things easy and |
483 | hard things possible", and see also the role it played in the |
484 | MIDI::Simple language, discussed elsewhere in this issue [TPJ#13]. |
485 | |
486 | Consider our first stab at an entry in our "phrasebook of functions": |
487 | |
488 | sub I_found_X1_files_in_X2_directories { |
489 | my( $files, $dirs ) = @_[0,1]; |
490 | $files = sprintf("%g %s", $files, |
491 | $files == 1 ? 'fichier' : 'fichiers'); |
492 | $dirs = sprintf("%g %s", $dirs, |
493 | $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires"); |
494 | return "J'ai trouv\xE9 $files dans $dirs."; |
495 | } |
496 | |
497 | You may sense that a lexicon (to use a non-committal catch-all term for a |
498 | collection of things you know how to say, regardless of whether they're |
499 | phrases or words) consisting of functions I<expressed> as above would |
500 | make for rather long-winded and repetitive code -- even if you wisely |
501 | rewrote this to have quantification (as we call adding a number |
502 | expression to a noun phrase) be a function called like: |
503 | |
504 | sub I_found_X1_files_in_X2_directories { |
505 | my( $files, $dirs ) = @_[0,1]; |
506 | $files = quant($files, "fichier"); |
507 | $dirs = quant($dirs, "r\xE9pertoire"); |
508 | return "J'ai trouv\xE9 $files dans $dirs."; |
509 | } |
510 | |
511 | And you may also sense that you do not want to bother your translators |
512 | with having to write Perl code -- you'd much rather that they spend |
513 | their I<very costly time> on just translation. And this is to say |
514 | nothing of the near impossibility of finding a commercial translator |
515 | who would know even simple Perl. |
516 | |
517 | In a first-hack implementation of Maketext, each language-module's |
518 | lexicon looked like this: |
519 | |
520 | %Lexicon = ( |
521 | "I found %g files in %g directories" |
522 | => sub { |
523 | my( $files, $dirs ) = @_[0,1]; |
524 | $files = quant($files, "fichier"); |
525 | $dirs = quant($dirs, "r\xE9pertoire"); |
526 | return "J'ai trouv\xE9 $files dans $dirs."; |
527 | }, |
528 | ... and so on with other phrase => sub mappings ... |
529 | ); |
530 | |
531 | but I immediately went looking for some more concise way to basically |
532 | denote the same phrase-function -- a way that would also serve to |
533 | concisely denote I<most> phrase-functions in the lexicon for I<most> |
534 | languages. After much time and even some actual thought, I decided on |
535 | this system: |
536 | |
537 | * Where a value in a %Lexicon hash is a contentful string instead of |
538 | an anonymous sub (or, conceivably, a coderef), it would be interpreted |
539 | as a sort of shorthand expression of what the sub does. When accessed |
540 | for the first time in a session, it is parsed, turned into Perl code, |
541 | and then eval'd into an anonymous sub; then that sub replaces the |
542 | original string in that lexicon. (That way, the work of parsing and |
543 | evaling the shorthand form for a given phrase is done no more than |
544 | once per session.) |
545 | |
546 | * Calls to C<maketext> (as Maketext's main function is called) happen |
547 | thru a "language session handle", notionally very much like an IO |
548 | handle, in that you open one at the start of the session, and use it |
549 | for "sending signals" to an object in order to have it return the text |
550 | you want. |
551 | |
552 | So, this: |
553 | |
554 | $lang->maketext("You have [quant,_1,piece] of new mail.", |
555 | scalar(@messages)); |
556 | |
557 | basically means this: look in the lexicon for $lang (which may inherit |
558 | from any number of other lexicons), and find the function that we |
559 | happen to associate with the string "You have [quant,_1,piece] of new |
560 | mail" (which is, and should be, a functioning "shorthand" for this |
561 | function in the native locale -- English in this case). If you find |
562 | such a function, call it with $lang as its first parameter (as if it |
563 | were a method), and then a copy of scalar(@messages) as its second, |
564 | and then return that value. If that function was found, but was in |
565 | string shorthand instead of being a fully specified function, parse it |
566 | and make it into a function before calling it the first time. |
567 | |
568 | * The shorthand uses code in brackets to indicate method calls that |
569 | should be performed. A full explanation is not in order here, but a |
570 | few examples will suffice: |
571 | |
572 | "You have [quant,_1,piece] of new mail." |
573 | |
574 | The above code is shorthand for, and will be interpreted as, |
575 | this: |
576 | |
577 | sub { |
578 | my $handle = $_[0]; |
579 | my(@params) = @_; |
580 | return join '', |
581 | "You have ", |
582 | $handle->quant($params[1], 'piece'), |
583 | "of new mail."; |
584 | } |
585 | |
586 | where "quant" is the name of a method you're using to quantify the |
587 | noun "piece" with the number $params[0]. |
588 | |
589 | A string with no brackety calls, like this: |
590 | |
591 | "Your search expression was malformed." |
592 | |
593 | is somewhat of a degerate case, and just gets turned into: |
594 | |
595 | sub { return "Your search expression was malformed." } |
596 | |
597 | However, not everything you can write in Perl code can be written in |
598 | the above shorthand system -- not by a long shot. For example, consider |
599 | the Italian translator from the beginning of this article, who wanted |
600 | the Italian for "I didn't find any files" as a special case, instead |
601 | of "I found 0 files". That couldn't be specified (at least not easily |
602 | or simply) in our shorthand system, and it would have to be written |
603 | out in full, like this: |
604 | |
605 | sub { # pretend the English strings are in Italian |
606 | my($handle, $files, $dirs) = @_[0,1,2]; |
607 | return "I didn't find any files" unless $files; |
608 | return join '', |
609 | "I found ", |
610 | $handle->quant($files, 'file'), |
611 | " in ", |
612 | $handle->quant($dirs, 'directory'), |
613 | "."; |
614 | } |
615 | |
616 | Next to a lexicon full of shorthand code, that sort of sticks out like a |
617 | sore thumb -- but this I<is> a special case, after all; and at least |
618 | it's possible, if not as concise as usual. |
619 | |
620 | As to how you'd implement the Russian example from the beginning of |
621 | the article, well, There's More Than One Way To Do It, but it could be |
622 | something like this (using English words for Russian, just so you know |
623 | what's going on): |
624 | |
625 | "I [quant,_1,directory,accusative] scanned." |
626 | |
627 | This shifts the burden of complexity off to the quant method. That |
628 | method's parameters are: the numeric value it's going to use to |
629 | quantify something; the Russian word it's going to quantify; and the |
630 | parameter "accusative", which you're using to mean that this |
631 | sentence's syntax wants a noun in the accusative case there, although |
632 | that quantification method may have to overrule, for grammatical |
633 | reasons you may recall from the beginning of this article. |
634 | |
635 | Now, the Russian quant method here is responsible not only for |
636 | implementing the strange logic necessary for figuring out how Russian |
637 | number-phrases impose case and number on their noun-phrases, but also |
638 | for inflecting the Russian word for "directory". How that inflection |
639 | is to be carried out is no small issue, and among the solutions I've |
640 | seen, some (like variations on a simple lookup in a hash where all |
641 | possible forms are provided for all necessary words) are |
642 | straightforward but I<can> become cumbersome when you need to inflect |
643 | more than a few dozen words; and other solutions (like using |
644 | algorithms to model the inflections, storing only root forms and |
645 | irregularities) I<can> involve more overhead than is justifiable for |
646 | all but the largest lexicons. |
647 | |
648 | Mercifully, this design decision becomes crucial only in the hairiest |
649 | of inflected languages, of which Russian is by no means the I<worst> case |
650 | scenario, but is worse than most. Most languages have simpler |
651 | inflection systems; for example, in English or Swahili, there are |
652 | generally no more than two possible inflected forms for a given noun |
653 | ("error/errors"; "kosa/makosa"), and the |
654 | rules for producing these forms are fairly simple -- or at least, |
655 | simple rules can be formulated that work for most words, and you can |
656 | then treat the exceptions as just "irregular", at least relative to |
657 | your ad hoc rules. A simpler inflection system (simpler rules, fewer |
658 | forms) means that design decisions are less crucial to maintaining |
659 | sanity, whereas the same decisions could incur |
660 | overhead-versus-scalability problems in languages like Russian. It |
661 | may I<also> be likely that code (possibly in Perl, as with |
662 | Lingua::EN::Inflect, for English nouns) has already |
663 | been written for the language in question, whether simple or complex. |
664 | |
665 | Moreover, a third possibility may even be simpler than anything |
666 | discussed above: "Just require that all possible (or at least |
667 | applicable) forms be provided in the call to the given language's quant |
668 | method, as in:" |
669 | |
670 | "I found [quant,_1,file,files]." |
671 | |
672 | That way, quant just has to chose which form it needs, without having |
673 | to look up or generate anything. While possibly not optimal for |
674 | Russian, this should work well for most other languages, where |
675 | quantification is not as complicated an operation. |
676 | |
677 | =head2 The Devil in the Details |
678 | |
679 | There's plenty more to Maketext than described above -- for example, |
ff5ad48a |
680 | there's the details of how language tags ("en-US", "i-pwn", "fi", |
9378c581 |
681 | etc.) or locale IDs ("en_US") interact with actual module naming |
682 | ("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the |
683 | details of how to record (and possibly negotiate) what character |
684 | encoding Maketext will return text in (UTF8? Latin-1? KOI8?). There's |
685 | the interesting fact that Maketext is for localization, but nowhere |
686 | actually has a "C<use locale;>" anywhere in it. For the curious, |
687 | there's the somewhat frightening details of how I actually |
688 | implement something like data inheritance so that searches across |
689 | modules' %Lexicon hashes can parallel how Perl implements method |
690 | inheritance. |
691 | |
692 | And, most importantly, there's all the practical details of how to |
693 | actually go about deriving from Maketext so you can use it for your |
694 | interfaces, and the various tools and conventions for starting out and |
695 | maintaining individual language modules. |
696 | |
697 | That is all covered in the documentation for Locale::Maketext and the |
698 | modules that come with it, available in CPAN. After having read this |
699 | article, which covers the why's of Maketext, the documentation, |
700 | which covers the how's of it, should be quite straightfoward. |
701 | |
702 | =head2 The Proof in the Pudding: Localizing Web Sites |
703 | |
704 | Maketext and gettext have a notable difference: gettext is in C, |
705 | accessible thru C library calls, whereas Maketext is in Perl, and |
706 | really can't work without a Perl interpreter (although I suppose |
707 | something like it could be written for C). Accidents of history (and |
708 | not necessarily lucky ones) have made C++ the most common language for |
709 | the implementation of applications like word processors, Web browsers, |
710 | and even many in-house applications like custom query systems. Current |
711 | conditions make it somewhat unlikely that the next one of any of these |
712 | kinds of applications will be written in Perl, albeit clearly more for |
713 | reasons of custom and inertia than out of consideration of what is the |
714 | right tool for the job. |
715 | |
716 | However, other accidents of history have made Perl a well-accepted |
717 | language for design of server-side programs (generally in CGI form) |
718 | for Web site interfaces. Localization of static pages in Web sites is |
719 | trivial, feasable either with simple language-negotiation features in |
720 | servers like Apache, or with some kind of server-side inclusions of |
721 | language-appropriate text into layout templates. However, I think |
722 | that the localization of Perl-based search systems (or other kinds of |
723 | dynamic content) in Web sites, be they public or access-restricted, |
724 | is where Maketext will see the greatest use. |
725 | |
726 | I presume that it would be only the exceptional Web site that gets |
727 | localized for English I<and> Chinese I<and> Italian I<and> Arabic |
728 | I<and> Russian, to recall the languages from the beginning of this |
729 | article -- to say nothing of German, Spanish, French, Japanese, |
730 | Finnish, and Hindi, to name a few languages that benefit from large |
731 | numbers of programmers or Web viewers or both. |
732 | |
733 | However, the ever-increasing internationalization of the Web (whether |
734 | measured in terms of amount of content, of numbers of content writers |
735 | or programmers, or of size of content audiences) makes it increasingly |
736 | likely that the interface to the average Web-based dynamic content |
737 | service will be localized for two or maybe three languages. It is my |
738 | hope that Maketext will make that task as simple as possible, and will |
739 | remove previous barriers to localization for languages dissimilar to |
740 | English. |
741 | |
742 | __END__ |
743 | |
744 | Sean M. Burke (sburkeE<64>cpan.org) has a Master's in linguistics |
745 | from Northwestern University; he specializes in language technology. |
746 | Jordan Lachler (lachlerE<64>unm.edu) is a PhD student in the Department of |
747 | Linguistics at the University of New Mexico; he specializes in |
748 | morphology and pedagogy of North American native languages. |
749 | |
750 | =head2 References |
751 | |
752 | Alvestrand, Harald Tveit. 1995. I<RFC 1766: Tags for the |
753 | Identification of Languages.> |
754 | C<ftp://ftp.isi.edu/in-notes/rfc1766.txt> |
755 | [Now see RFC 3066.] |
756 | |
757 | Callon, Ross, editor. 1996. I<RFC 1925: The Twelve |
758 | Networking Truths.> |
759 | C<ftp://ftp.isi.edu/in-notes/rfc1925.txt> |
760 | |
761 | Drepper, Ulrich, Peter Miller, |
762 | and FranE<ccedil>ois Pinard. 1995-2001. GNU |
763 | C<gettext>. Available in C<ftp://prep.ai.mit.edu/pub/gnu/>, with |
764 | extensive docs in the distribution tarball. [Since |
765 | I wrote this article in 1998, I now see that the |
766 | gettext docs are now trying more to come to terms with |
767 | plurality. Whether useful conclusions have come from it |
768 | is another question altogether. -- SMB, May 2001] |
769 | |
770 | Forbes, Nevill. 1964. I<Russian Grammar.> Third Edition, revised |
771 | by J. C. Dumbreck. Oxford University Press. |
772 | |
773 | =cut |
774 | |
775 | #End |
776 | |