Let’s get this straight. I’m a big machine translation fan. Many of my colleagues think machine translation (or MT, as they call it) is the devil. They worry that MT might be after their jobs, or that MT might be giving a bad name to the general concept of translation. As a specialist, I see MT the way that a litigation attorney might see a website that will write your Last Will and Testament for $9.95 — it’s in the same general field, but it’s hardly a threat to my business.
If anything, in the world of patent translation, MT has increased the demand for human translation. Because of MT, foreign documents that nobody would have looked at in the past are being read. Some of these documents turn out to be significant. The significant ones get sent to people like us for a real translation.
You can imagine, then, that I was pretty pumped to hear that Google, the most beloved resource of all translators, is offering something new in the MT world. The truth is that Google has been offering machine translation for some time, but they recently revamped their systems in a way that has been causing a stir.
The new system works with a modified version of the currently fashionable statistical machine translation (SMT) method. Although I worked in MT development in the late 80s, and Patent Translations Inc. even offered a machine translation service for a while, I don’t have much expertise with SMT. I guess, as a child of the 80s, I still think of MT in terms of rules. I did, however, see a presentation on the state of the art of SMT at the 2006 ATA Translation Companies Division Conference in New Jersey. I’ve got to say, I wasn’t impressed.
SMT sounds great on paper. It works by looking at how human translators have translated things in the past and applying what it finds to new translations that it is asked to perform. The problem is that, in doing so, because rules are subordinate to statistical trends, it tends to forget about the grammar that bound the words in the original sentence together. Worse, some words can be left out of the translation all together, just because they were not used in the translation corpus that the program is basing its decisions on. The translated sentences often appear to make sense, but when compared to the source, it is clear that the original meaning was very different.
When I heard about Google’s new system, I gave it a French patent as a test. French is usually the best language for MT, and patents are fairly MT friendly. At first pass I was really impressed. There were some wacky bits, of course, but the overall readability was very good for MT, and the system had done what looked like an outstanding job at translating multi-word technical terminology — something that rule-based systems are notoriously poor at. Within a few minutes, however, my happy surprise had turned to dismay. Upon closer examination, I saw that identical technical terms were being parsed and translated in radically different ways at different places in the document. Completely extraneous material, which had not been so much as suggested in the source, was to be found here and there throughout the translation. And time and time again, important lexemes were missing. Not much headway has been made since 2006.
I’m not going to post the patent — it’s too long for one thing — but we can get an idea of both the advantages and disadvantages of the system by taking a look at how it handles the front page of Libé (a popular French paper) . Google translates the first headline as, “The students’ struggle to save their parade profs.” One is left wondering what a “parade prof” is but, as a liberal arts major, it’s not too hard for me to imagine, and the sentence flows well enough, so I’m likely to be willing to go with the flow and hope that I’m getting the gist. Unfortunately, the original French was, “Des lycéens «en lutte» défilent pour sauver leurs profs.” Systran’s rule-based MT translates the same sentence as “High-school pupils “in fight” ravel to save their Profs.” While it is clearly more common to see students unravel than ravel, at least the original grammatical meaning of the sentence is more or less preserved. And if we don’t know what the computer means by “ravel,” at least we know that we don’t know. (We could even go and ask a real translator, who would translate “défiler” in this context “march.”)
The next bit of text — the lead — shows Google Translate in a better light. The source is “Plusieurs milliers de lycéens se sont rassemblés à Paris cet après-midi pour protester contre les suppressions de postes d’enseignants. Une nouvelle manifestation doit avoir lieu mardi.” Systran (rules based) gives this as, “Several thousands of high-school pupils gathered in Paris this afternoon to protest against the removals of posts of teachers. A new demonstration must take place Tuesday.” While the meaning is clear enough, it is definitely unpleasant to read. The readability is much better with Google, which gives it as, “Several thousand students gathered in Paris this afternoon to protest against the abolition of posts of teachers. A new event is scheduled to take place Tuesday.”
The last sentence is particularly impressive. It’s smooth. It’s slick. It conveys the gist of the source text. It even sounds like it was written by a native speaker of English. There is only one problem: that’s not what the source text said. There was no specific mention of scheduling, and the thing that was to take place was not a generic event — it was very specifically a demonstration.
Some readers may think that I am splitting hairs, and I would be the first to admit it. I wrote an entire chapter on hair splitting for the ATA Patent Translator’s Handbook. That is because patent practice — the prosecution and litigation of patents — is all about splitting hairs. You get whole teams of lawyers working through the night on arguments such as, “You said circle, but this is an oval,” or “Moving something is totally different from transporting something.”
That makes Google Translate woefully inappropriate for patent attorneys. When reviewing prior art –which is the only thing that patent attorneys use MT for — what an attorney wants to know is whether certain technical ideas have been described. To do that, they need to know if specific elements have been mentioned. Google Translate, while producing relatively readable output, adds (“scheduled”) and removes (“demonstration”) specific elements, according to the whims of its statistical heart. And that just won’t do.
For now, patent practitioners are better off sticking to the built in MT engines available on the EPO and JPO websites and the rule-based Systran system.
For other people, Google Translate is certainly going to be a useful offering. The readability and the (somewhat unreliable) capacity to convey gist will make things written in foreign languages more accessible. Google translate also comes with some pretty nifty bells and whistles. My favorite is the ability to see what the source text actually said, just by moving your mouse over the text. It also gives you the capacity to search (literally) the World Wide Web in foreign languages by typing in a word in English and letting Google translate that for you before using it in a search and then returning the hits in translated form.
So while the new kid on the block is unlikely to rock the world of patent practitioners, it at least makes a cool toy to play with.