fortc.com

How to use zim-tools

This guide explains how to compile zim-tools on an Ubuntu workstation and use zimdump to extract the files from a ZIM archive.

The Kiwix project is an open-source initiative that makes online content available for offline use. Kiwix allows users to download entire websites, such as Project Gutenberg, in a compressed file format known as ZIM. Kiwix also maintains OpenZIM, a set of open-source tools to used manipulate ZIM archives.

Install prerequisites

On a freshly-installed Ubuntu 22.04 server, install the prerequisite packages.

$ sudo apt-get install liblzma-dev \
    libicu-dev \
    libzstd-dev \
    libxapian-dev \
    meson \
    libdocopt-dev \
    libkainjow-mustache-dev \
    libmagic-dev \
    zlib1g-dev \
    libgumbo-dev \
    libicu-dev \
    cmake

Install libzim

Create a project directory and clone the libzim repo.

$ mkdir ~/OpenZIM
$ cd ~/OpenZIM
$ git clone https://github.com/openzim/libzim.git
$ cd libzim

Compile and install libzim.

$ meson . build
$ ninja -C build
$ sudo ninja -C build install

Install zim-tools

Clone the zim-tools repo.

$ cd /home/username/pg-files/OpenZIM
$ git clone https://github.com/openzim/zim-tools.git
$ cd zim-tools

Compile and install zim-tools.

$ meson . build
$ ninja -C build
$ sudo ninja -C build install

Test zimdump.

$ zimdump --version
zim-tools 3.2.0

libzim 8.2.0
+ libzstd 1.4.8
+ liblzma 5.2.5
+ libxapian 1.4.18
+ libicu 70.1.0

Dump a ZIM archive

As a test, download the Project Gutenberg ZIM file from Kiwix and verify the sha-256 sum matches.

This example is from May 2023. See the Kiwix website for the latest version.

$ curl -O --progress-bar https://download.kiwix.org/zim/gutenberg/gutenberg_en_all_2023-05.zim

$ sha256sum gutenberg_en_all_2023-05.zim
c57133c971c7cf82df907e8fe037e84d7ee2d54ec6bd72af97b6ba509e33d9cf  gutenberg_en_all_2023-05.zim

Dump the ZIM to a dump directory.

$ mkdir dump
$ zimdump dump --dir=dump gutenberg_en_all_2023-05.zim

Wait about 20 minutes for it to complete, then check the directory. There are 1,423,653 total files in this example.

$ ls -l dump | wc -l
1423653