Creating EPUB from Scanned PDF with MinerU and LLMs

As a book reader, I read over a hundred books each year and collect much more. The preferred format is absolutely EPUB, however, I can’t always get books in EPUB/MOBI especially for rare or old books.

Usually, they are available in PDF if at all. Some of these PDFs are manually scanned in a barely readable condition. I wouldn’t blame on them since I’ve been doing that before and know that is not easy. What I need is a tool to convert the not so readable book into a readable one with OCR and LLM, that is MinerU.

Installation

Because MinerU-webui is not actively maintained, using docker version of MinerU with its built-in WebUI is the way to go.

mkdir MinerU && cd MinerU

#test GPU
docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

#Troubleshooting
#https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-error-in-docker-after-few-hours

Based on this guide run:

wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/global/Dockerfile

docker build -t mineru-sglang:latest -f Dockerfile .

docker run --gpus all   --shm-size 8g   -p 30000:30000 -p 7860:7860 -p 8000:8000   --ipc=host   -it mineru-sglang:latest   mineru-gradio --server-name 0.0.0.0 --server-port 7860 --rec_batch_num 2

Note: To use VLMs, passing --enable-sglang-engine true for newer GPU (Turing and up)

Now, just upload a PDF, select corresponding language and convert it into .MD files!

Check on Force enable OCR for watermarked or textured images like poster, slides and photos

Optimization

This project has a bad memory management design, which needs to be relaunched to release RAM manually in the case of OOM (rare issue though)

If that doesn’t work, try passing --rec_batch_num 2 or set environment export MINERU_VIRTUAL_VRAM_SIZE=8 according to the actual hardware

Optional: Add more swap if there is lack of physical RAM while working with very large page numbers.

fallocate -l 64G /home/swapfile
chmod 600 /home/swapfile
mkswap /home/swapfile
swapon /home/swapfile

nano /etc/fstab

UUID=xxxxx-xxx swap swap defaults,pri=100 0 0
/home/swapfile swap swap defaults,pri=10 0 0

Check with swapon --show and free -h

Post-processing

After getting the markdown file, there might be some improvement to be done.

Use regex to remove excessive elements such as line breaks \n\n# \n\n, headers \n\ntitle\n\n, and faulty mathematical formulas \$[^$]+\$.
Use Calibre to convert the .MD file into .EPUB book while fixing layout issues and generating table of contents.
Use Ebook-Translator-Calibre-Plugin with local hosted LLM to polish the text.

Just make sure the source/target language are set to the same, and here are some prompts I would like to use to optimize the result:

You are a professional book editing machine which is specialized at reviewing and revising books. 
You keep a high standard on fixing typos, missing words and optimizing text layout. 
You never answer any question nor explain/summary anything. 
You are very good at fixing PDF issues which caused by automation tool like OCR. 
You don't rephrase or rewrite any sentence, but only fix issues. 
You never translate input text. 
You never add your opinion or reasoning process into the output.
Fix errors by correcting typos and remove anything that is unable to be corrected.
Optimize incomplete paragraph and line breaks. 
Keep incomplete words at the beginning or the end of input as is.
Do not add period/punctuation at the end of input if there is none.
Do not optimize punctuation if there is no error. 
Do not reword rhetorical wordplay, neologism, metonymy or metaphors. 
Remove any redundant elements from text body, such as repeating texts, chapter titles, headers, footers, reference numbers, unreadable coding strings and page numbers. 
Never leave any note nor explanation out of input text. 
Do not remove in-text notes with '()'. 
Do not use coding syntax. 
Do not state nor explain what you did or removed, only keep polished text in the output. 
Do no translation. 
If there is nothing to do, then just repeat the input text.

Tweak these prompts based on the text and specific model used. Turn down values of temperature, repeat_penalty, repeat_last_n, top_k, and top_p to ensure maximum output integrity.

To maximize context length, increase the value of num_ctx to match up n_ctx_train (can be known by look up with ollama log with sudo journalctl -f -u ollama.service. More information on Ollama.

Installation#

Optimization#

Post-processing#

Installation

Optimization

Post-processing