Translate epub with LLM.
I recently watched Cosmic Princess Kaguya, and they have a official book with more details that doesn't make into the movie.
I love this movie, and naturally I want to read the book. However, I can't read a single Japanese character, and I don't want to learn Japanese. So I want to translate the book with LLM.
To begin with, I purchased the book from kobo jp, when I purchased it, it said there would be an Adobe DRM protected epub for download. However, after the purchase, I found that the whole kobo jp does not offer any download option.
So I cracked the DRM and eventually got the epub. But I still cannot understand Japanese. So I created this project. Initially I was just thinking about using OpenCode to do it, but then I think it would be important to make everything reproducible. So I wrote code.
The code will unpack the epub, override the layout, handle some nasty Japanese layout stuff, and extract the xhtml file into a json, which will be easier to handle. With the Json, I performed entity extraction, so the key characters will have a consistent name. I also created a script to translate the paragraphs with LLM, also a QA LLM. Unfortunately, the QA is semi-automated, LLM checks and gives advice, but then I have to go through them one by one, decide if and how to make the change. Japanese is not a clearly expressed language, so most of the time, I have to guess what it means.
But anyway, with in 48 hours (the weekend), I managed to convert this 300+ page epub from Japanese to zh-Hant (I speak zh-Hans, but I purchase epub from Taiwan book store, so I'm used to reading zh-Hant).
Now I can read the book on my Kindle, with joy.
BTW, this project is NOT meant to replace professional translation service. If kadokawa published an official translation book, I will happily purchase it. The LLM doesn't have logic, so it can't really think like a human, which means it can't understand the plot and figure out who is who.
This project is merely a PoC and a hacky solution for a desperate programmer who doesn't understand Japanese but eagerly want to read the book related to a movie he loved very much.
- UnpackAndClean: Unpack epub and clean up, this will fix several things so it won't stuck with jp layout.
- ExtractEntity: Scan the whole text, find people, things, and reference to previous lines.
- Simple Translate: Translate the text with string replace, mostly for copyright claims or something like that.
- Extract structure: Write content into a json so it's easier to process.
- Extract entity: list all entities, fix the translation.
- Translate: Translate the text with LLM, and replace the original text with the translated text.
- QA: Check the translation using LLM, flag sentence with flaw or error. Fixing them manually.
- Pack epub.
Implemented EPUB unpack and pack.
Unpacked Cho kaguyahime 001 epub, found js for kobo. Might be residue from Calibre obok plugin.
Asked Gemini 3.1 pro, it is useless, especially if I'm gonna read it on Kindle or other e-ink devices.
About the class koboSpan, there is an inline style: <style xmlns="http://www.w3.org/1999/xhtml" type="text/css" id="koboSpanStyle">.koboSpan { -webkit-text-combine: inherit; }</style>, which does not affact anything
after the translation (target lang: zh_CN). So we can also remove it.
To change from vertical to horizontal layout, first we need to set page progression direction
from right-to-left to left-to-right. We can also update the css, but in the style-standard.css,
there is a section:
/* 横組み用 */
html,
.hltr {
-webkit-writing-mode: horizontal-tb;
-epub-writing-mode: horizontal-tb;
}
/* 縦組み用 */
.vrtl {
-webkit-writing-mode: vertical-rl;
-epub-writing-mode: vertical-rl;
}So we can simply replace the class vrtl with hltr to change the layout to horizontal.
For Gaiji, there four of them in the book I'm working with, so I decide to have a map to handle them.
Add code to set title and lang correctly.
Add code to replace ja and jp font with general font.
Add code for the simple translation. Print all pure text elements, printing out xpath along with the text content. This doesn't work well for ruby tags. However, the simple translation is meant for sentences like copyright claims, so we can manually fix it by overriding the text and remove ruby tags.
Extract xhtml content into structured json. Each block can be traced back when we apply the translation. Also perform chapter summary to support per sentence translation.
LLM translate is done. It can do 82 pages in 20 minutes, where the total pages is 336.
I still need a LLM QA to find:
- invalid xhtml, for example, space before
<span>instead of inside of it - Missing sentences, sentences in Japanese are not translated.
- Wrong translation, get the meaning wrong.
- Fabrications, creating own meaning instead of stick with the Japanese content
- Wrong term or entity name.
I'll let it flag the invalid sentence and I'll fix them manually. This way, it should allow me to fix things without reading the whole book.