blog.lukhnos.org

Formosana, a Collection of C++ Libraries for Processing Taiwanese Languages

Formosana is a C++ library collection that provides basic building blocks for processing Taiwanese languages. Currently three languages are supported: Mandarin, Holo and Hakka. It also provides a language-agnostic, bigram-based word segmentation library. It has no external dependencies and can be built on most platforms I know of. It is available on github under the MIT License.

My day job is commercial iPhone and Mac software development. In addition to that, I also develop open source software, mostly in the form of libraries. Designing libraries and frameworks is both a good exercise in itself and an important part of software development. It pushes you to think and plan head for future consumption, and it also gives you a good opportunity to think about the fundamentals of a given problem set.

Formosana currently has three major components:

  1. Formosa::Mandarin: A library for processing Mandarin syllables and handling text input keyboard layouts. An abstract data type represents Mandarin syllable. The syllable data type accepts both Pinyin and Bopomofo as input, and can be used to convert to either form as output. Its internal representation guarantees that the syllable in always in the CVCT form, although it does not guarantee that the produced syllable is always phonetically grammatical (i.e. it can be used to produce syllables not found in the actual Mandarin). It also support four major keyboard layouts (expandable) that map a standard US keyboard to Bopomofo symbols.

  2. Formosa::TaiwaneseRomanization: A library for processing Romanized Holo and Hakka. An abstract data type represents Holo or Hakka syllable. Internally it uses POJ (pe̍h-oē-jī, also called Church Romanization by some). It accepts POJ for both input and output. Tâi-lô (TL, or Taiwanese Romanization System), technically a POJ variation that is the standard for Romanized Holo used by Taiwan's Ministry of Education, can also be used as both input and output. This syllable library has a normalization member function that guarantees only the composed tonal mark is placed on the correct vowel character according to the resonance in the Holo language. It is weaker than its Mandarin counterpart in that the syllable class does not guarantee the represented syllable is always in the Initial+Vowel+Final+Tone form. It accepts both "composed" form (syllable with diacritics) and "uncomposed" form (tone in numerals, also called database query form in the library) for input, and can also produce output in either forms. This library also supports keyboard layout mappings. Both numerical tone input and dead key combinations are supported.

  3. Formosa::Gramambular (literally, "gram walking"): A language-agnostic, bigram-based word segmentation library. It accepts an input set of weight unigram and bigram key-value pairs, and output a best-scored path. If the key is input syllables and value is a Chinese phrase that the syllables represent, the walk is an input method. If we reverse the key and value, it becomes a word segmentation tool. As the library works without any grammatical knowledge, the quality of the dictionary (that provides the data source for weighted nodes) is the deciding factor of the output quality. I have mentioned the principal of the library's design in a talk at Open Source Developer Conference, Taipei, Taiwan, in 2008. As a bonus, Gramambular has a debug helper that can produce outputs in the Graphviz DOT format, which you can then feed into the tool and get visualizations like this and this.

Each of the components comes with its own demo code. I have also supplied Makefiles (for Mac OS X and other UNIX platforms) and Microsoft Visual C++ solution files for those sample projects.

The library collection makes use of a few helper classes from The OpenVanilla Project. I have included those class files (also written by me) in the source to make the collection buildable with no external dependencies.

Formosana was first designed for developing input methods, and both the Mandarin and the Taiwanese Romanization modules have been used in actual products. Although Gramambular has not yet been used in production, I have previously worked on an implementation based on similar principles for an internal project at my company. The commit history of the project will tell you that Gramambular was written pretty fast (2 days) from ground-up. For me it was also an exercise to start over from scratch to see if the design is solid.

The library collection has many other uses in processing Taiwanese languages. There is also space for improvement. For example, a syllable class that can validate against the phonetic grammar is highly desirable. Currently the Taiwanese Romanization class instances are mutable. Normalization changes internal representation instead of returning a new immutable object. In addition, for the libraries to be useful for building language-related web applications, bindings to major scripting languages are also desirable. These are the things that developers interested in the field can work on.

I'll be highly interested in hearing from you if you use or plan to use Formosana in your own projects. My contact info regarding to this project can be found on my github profile.