How-to create a Python scraper

From openZIM
Revision as of 07:41, 4 July 2024 by Benoit74B (talk | contribs) (Created page with "=== Guidelines === A Python scraper should ideally: - adhere to adheres to openZIM's [https://github.com/openzim/overview/wiki/Contributing Contribution Guidelines] and imple...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Guidelines

A Python scraper should ideally:

- adhere to adheres to openZIM's Contribution Guidelines and implement openZIM's Python bootstrap, conventions and policies

- by hosted on Github under the openzim organization (we can create you a repo there on request)

- use the python-scraperlib (zimscraperlib on PyPi) to create the ZIM (and there are many useful utilities as well)

- reencode images and videos so that the final ZIM size is (by default at least) moderate

- cache these reencoded assets on an S3 bucket (we can provide you with a dev bucket on request) so that scraper avoids to loose time / computing resources reencoding them at every ZIM update

- be configurable with CLI flags, especially for ZIM metadata (title, description, tags, ...) and filename

- validate all these metadata as early as possible to avoid spending time fetching online resources and transforming them only to realize in the end that metadata are not valid and we cannot produce a ZIM

- avoid as much on possible to rely on the filesystem, i.e. prefer to add items to the ZIM on-the-fly rather than arranging every files on the filesystem and adding them to the ZIM only in a final stage

- consume as little resources as possible (CPU time, disk IOs, disk space, RAM memory, ...)

- implement proper logging with various log levels (error, warning, info, debug)

- implement a task progress JSON file so that integration in Zimfarm will be smoother

How to develop a nice UI to run inside the ZIM

Original scrapers are using Jinja2 to render HTML files dynamically and add them to the ZIM. We are currently migrating to another approach where the UI running inside the ZIM is a Vue.JS project. We are not yet certain which approach is best. Vue.JS allows to quickly built very dynamic interfaces in a clean way, where Jinja2 approach usually relied on "crappy" JS based on JQuery and stuff like that. However Vue.JS comes with a probably more limited set of supported browsers and induces a more steep learning curve to contribute on scrapers. Freecodecamp scraper is already using this Vue.JS approach. Youtube scraper is currently migrating to this approach. Kolibri scraper has began to migrate both stuff is still stuck in a v2 branch.