"Misho", My Japanese Dictionary App

03 Nov 2020

Misho

After two years of on-and-off development, I finally released my Japanese dictionary app Misho to the Android Play Store.

In this post, I’ll cover some of the technical challenges that occurred during its development.

Why Another Japanese Dictionary?

When I moved to Tokyo to study Japanese three years ago, I knew that practicing a native pitch accent was going to be an important part of my education. Unfortunately, none of the dictionaries on the app stores today have pitch accent data. I ended up writing code to pull it from various websites automatically.

The average, non-programmer Japanese student isn’t going to have access to this information. Misho democratizes access to Japanese pitch accents by packaging it into an easy-to-use dictionary app.

Development Challenges

There were three major challenges during development.

I had to merge several pitch accent data sources into a primary bilingual Japanese dictionary.
I had to build fast, correct search for both English and Japanese-language queries for SQLite.
I had to juggle dozens of asynchronous signals in a sane way to optimize my typical usage pattern.

Challenge #1: Merging Data Sources

I discovered a couple sources of pitch accents, but none of them were coincidentally bilingual dictionaries. I couldn’t just render the source of a single set of data in an app - I had to process them into a single, merged format. At this point, I decided to unify everything with what I think is the most easily accessible Japanese dictionary data, Jim Breen’s JMDict.

Pathological Cases

The primary problem with aligning data from two dictionaries is the potential to produce incorrect matches. For example, take the following two examples:

Kanji	Reading	Accent	Meaning
市場	しじょう	0	market (ie. stock market)
市場	いちば	1	market (place)

The two words above have the same kanji spelling, but different readings - and depending on the reading, the accent is different. Clearly aligning words solely based on their kanji reading is insufficient!

Consider another pair of words that pose a challenge:

Kanji	Reading	Accent	Meaning
橋	はし	2	bridge
箸	はし	0	chopsticks

In this case, these two words have the same reading, but two different kanji spellings. Depending on the kanji (and thus the meaning), the accent is also different.

I quickly realized that aligning words was clearly more nuanced than just checking if a single piece of information was the same. In order to guarantee correctness of word alignment across data sources despite these pathological edge cases, I developed two heuristics.

Heuristic #1: Full Match

If the kanji and reading of a word is the same across two sources, they align. I considered adding part of speech and definition keyword comparison to this case, but was unable to find a single example where they were actually required for correctness.

This was the gold standard. Unfortunately, a large number of Japanese words do not have kanji associated with them to begin with (for example, all katakana words), so I did not consider this heuristic alone sufficient.

Heuristic #2: Common Accent Match

If all words of a particular reading had the same accent, then I assigned all words of that reading in the second source as well.

I don’t think it’s likely that one source would have a particularly rare word of said common reading, but have an additional, rare word whose pitch accent was not available and different from all other words with that reading.

This case successfully handled aligning entries for virtually all katakana words, onomatopoeia, and a hodgepodge of other words.

Heuristic Results

Across both heuristics, I was able to assign accents from various data sources to roughly half the words in the main bilingual dictionary, including a vast majority of common words.

Here are the final statistics on each heuristic’s performance:

annotating accents: 100%|██████████| 184955/184955
accent statistics:

full match      [77743]
common match    [27102]
none            [94664]

I developed and analyzed the heuristics in Python using a Jupyter Notebook. Being able to tweak a heuristic and immediately re-run the step at exactly that point in the data was vital in letting me iterate on my ideas efficiently.

Challenge #2: Efficient Search

Misho supports English- and Japanese-language search, and both are fast. During development, I discovered that bilingual dictionaries actually require two different types of search.

For home language (here, English) queries, the user is generally looking for natural language results. This is because they don’t necessarily know the exact word they’re looking for - it’s a foreign language after all.

On the other hand, target language (here, Japanese) queries require an exact match. The user isn’t looking for similar words - they want that exact word’s information. Naturally, this requires a different approach to search.

I decided to build two search systems, handling each usage pattern separately.

Additionally, two design decisions I made early on helped frame my performance requirements.

First, I wanted Misho to be completely offline, both for security and long-term maintenance reasons. The entire dictionary has to be compressed and deployed with the app. As a result, search index size was a major concern.

Second, I wanted to support “search as you type”, so the user can stop typing as soon as they see what they need. This put some fairly strict latency requirements on a round-trip search request.

After some preliminary testing, I set my performance requirements as follows:

Minimal index size, since it will all get packaged in the app.
Average and maximum query latency below 100ms for both, English and Japanese language search.

English Language Search

For English language queries, I added natural language search using the SQLite FTS5 Extension, with document weights computed using “common word list” rankings. Queries are generally below 50ms on an average Android device, and the indices clocked in at ~20% of the data table size, making it a completely reasonable implementation.

Here’s what the actual, in-app query looks like:

SELECT entries.*
FROM entries
JOIN (
    SELECT docid AS id
    FROM entries_fts
    WHERE entries_fts MATCH ?
    ORDER BY rank ASC
) USING (id)
LIMIT 100

This gave me a false sense of confidence - developing fast search for Japanese ended up taking me the next three months!

Japanese Language Search

I couldn’t use FTS5 here because I needed to support exact partial matches. Specifically, Japanese students often search for a single kanji to explore its usage across words - this is quite frequently not a natural language usage for a character.

My first implementation just used SQLite’s LIKE operator. I reasoned that the dictionary only has ~200k entries, so it might just be fast enough. Unfortunately, the average query took upwards of 400ms, far too slow for the “search as you type” feature.

At this point I started looking at how to accelerate LIKE using an index. Ultimately, I decided to go with an n-gram index, which chunks up data into n-letter groups, and indexes those. Unfortunately, SQLite doesn’t have a ready-made n-gram extension so I had to implement my own.

Round 1: Normalized N-gram Index

The normalized way to create an n-gram index is to create a table as follows:

Column	Description
`gram`	The n-gram itself. This will be indexed and queried using `=`.
`item_id`	The foreign key ID of the item that possesses this gram.

When searching, you take the input query, produce its n-grams, and query all rows with each gram. The final search result is the intersection of all returned item IDs, meaning that those item IDs matched every gram.

Unfortunately, the table described above is prohibitively large in SQLite. Producing 1-grams, 2-grams, and 3-grams for the dictionary in Misho yielded an index about 1000% larger than the base data set, a far cry away from the 20% of the FTS5 English index.

Moreover, the = operator itself was still too slow because there were over a millions rows.

Not good.

Round 2: Denormalized N-gram Index

At this point, I still thought the n-gram approach was sound - I just had too many rows. So I decided to try de-normalizing my schema with the following table:

Column	Description
`gram`	The n-gram itself. This will be indexed and searched using `=`.
`item_ids`	The foreign-key IDs of all items possessing this gram.

When searching, an additional step is required to unpack the item ids prior to computing their intersection.

In return for this additional processing step, the number of rows in the n-gram index table was reduced from many times the number of words, to a small fraction of the number of words. Search was suddenly fast! Especially quick were those pesky, single kanji queries, since they required looking up just one row.

But space was still a problem - I was using numerical item IDs whose text representation varied from 5-6 characters. At first I tried Gzip and LZMA, but decoding was too slow. I settled with simple delta-based encoding using topologically sorted item ID arrays. This reduced the index size to a mere 50% of the original data size while maintaining nearly instantaneous “decoding” post query.

Search Implementation

Although search speed and index size was solved, the actual procedure for using the tables remained. I initially wrote a recursive SQL procedure for performing everything in-database, but soon discovered that splitting it into two queries and performing the decoding/intersection part in normal code was both, more legible and more performant.

Following is the actual Dart implementation from the app. Given a string query, I generate its trigrams and retrieve the relevant IDs from the database. A lack of matches implies no results.

final trigrams = makeTrigrams(query);
final trigramRows = await db.rawQuery('''
    SELECT entry_ids
    FROM entries_trigram
    WHERE trigram IN (${List.filled(trigrams.length, '?').join(',')})
''', [
    ...trigrams,
]);

if (trigramRows.isEmpty) {
    return Optional.absent();
}

Next, we undo the delta-encoding and intersect all decoded entry IDs in Dart. Again, if no IDs intersect then no IDs existed that could possible match the current query.

// Row entry IDs look like this: 432+34+4+1+34+943
final entryIds = trigramRows
    .map((row) => row['entry_ids'] as String)
    .map((ids) => ids
            .split('+') //
            .map((id) => int.parse(id)) //
            .fold(
                [0],
                (List<int> decodedIds, int delta) => decodedIds //
                    ..add(decodedIds.last + delta))
            .skip(1) //
            .toSet() //
        )
    .reduce((sharedIds, ids) => sharedIds.intersection(ids));

if (entryIds.isEmpty) {
    return Optional.absent();
}

Lastly, we take the final entry IDs and actually query them from the main search table, which contains rank information.

final entryRows = await db.rawQuery('''
    WITH limited_entries_search AS (
        SELECT *
        FROM entries_search
        WHERE entry_id IN (${List.filled(entryIds.length, '?').join(',')})
    )
    SELECT entries.*
    FROM (
        SELECT DISTINCT entry_id
        FROM limited_entries_search
        WHERE term LIKE ?
        ORDER BY rank ASC, LENGTH(term) ASC
    ) results
    JOIN entries
    ON entries.id = results.entry_id
    LIMIT 100;
''', [
    ...entryIds,
    '%$query%',
]);

final entries = entryRows
    .map((entry) => Map<String, dynamic>.from(entry))
    .map((entry) => Entry.fromJson(entry))
    .toList();

return Optional.of(entries);

The entire process takes 70-80ms on average on my Pixel 3, which fell within my requirements.

Challenge #3: Asynchronous Signal Management

I chose Flutter to develop Misho, but experienced extreme difficulties managing asynchronous signals about halfway through the implementation. A dictionary makes for a relatively simple app, but there is one place where I took extreme care: the keyboard.

In my initial tests, having to manage opening and closing the keyboard was the most frustrating, time-consuming aspect of using the app. I resolved to have the keyboard open and close correctly, on its own. In order to do so, I had to isolate the signals necessary for this computation, but many of these signals come from disparate Android APIs. The code was a total mess!

Around the time I was wrestling with this problem, I came across “hooks”, for which I wrote a (very) brief blog post about. I eventually rewrote the entire application using flutter_hooks. Aside from trivializing state management through the entire application, it also brought sanity and correctness to my keyboard management code.

The following is the actual keyboard management code from Misho. The keyboard pops up when the FocusNode attached to the search bar is programmatically focused using requestFocus.

final recentlyResumed = useRecentlyResumed();
final search = useSearch();
final searchStateStream = useStream(search.state, initialData: Initial());
final searchState = searchStateStream.data;
final previousSearchState = usePrevious(searchState);
final searchFocus = useMemoized(() => FocusNode());
final onCurrentRoute = ModalRoute.of(context).isCurrent;
final previousOnCurrentRoute = usePrevious(onCurrentRoute) ?? false;

useEffect(() {

    // Bring up the keyboard when...
    //  1. the search bar is showing
    //  2. search is ready
    //  3. the current scaffold's drawer is not showing
    if (onCurrentRoute && 
        searchState.ready && 
        !scaffoldKey.currentState.isDrawerOpen) {

        // ...and any of the following occurs:
        //  1. the app resumes
        //  2. the user came back to the search bar's screen
        //  3. the search database initialized successfully
        if (recentlyResumed || 
            !previousOnCurrentRoute || 
            !previousSearchState.ready) {

            // Keyboard won't come up unless you unfocus first.
            // https://github.com/flutter/flutter/issues/33305
            if (searchFocus.hasFocus) {
                searchFocus.unfocus();
            }

            searchFocus.requestFocus();
        }
    }
}, [recentlyResumed, onCurrentRoute, searchState.ready]);

I found this hook-based code far more palatable, and it worked correctly (almost) immediately. In hindsight however, I believe I would struggle to test this in an automated way. Since that wasn’t a goal nor a requirement for this project, the implementation was entirely satisfactory.

Two Years Later

This section was written on December 17th, 2022.

I wrote this article two years ago, but chose not to publish it. I felt that the article is too complex because the content tends toward relatively opaque code snippets. But upon spontaneously reviewing today, I found that some of my problems and solutions were interesting even if the article isn’t organized in the most intuitive way.

So, here it is! Better late than never, right?

Developing a Japanese dictionary was a challenging experience, but ultimately I got what I wanted as a learning tool. I still use Misho as my daily driver!

Additionally, in the last few years many other dictionaries on the Android Play Store have started to display pitch accents nearly identical to the ones in Misho! I think this is a great development. Misho is not quite popular enough to claim that it influenced those apps’ development, but I’m happy that accent information has gradually become mainstream regardless the impetus.

Happy learning, and happy hacking!