Difference between revisions of "Content team/ZIM Naming Convention"
(better explain the lang + use domain instead of project in ZIM Name format) |
(Add see also link to Metadata) |
||
Line 86: | Line 86: | ||
'''Important''': when setting filename manually, you are responsible for the whole filename, including the period part. Most scraper allow inserting a special `{period}` string that will be replaced with the year-date one. Ex: <code>supersite.com_en_all_{period}.zim</code>. | '''Important''': when setting filename manually, you are responsible for the whole filename, including the period part. Most scraper allow inserting a special `{period}` string that will be replaced with the year-date one. Ex: <code>supersite.com_en_all_{period}.zim</code>. | ||
=== See also === | |||
[[Metadata]] |
Revision as of 08:39, 13 June 2024
This page was originally located at https://github.com/openzim/overview/wiki/ZIMs-Naming-Convention
This page explains the naming convention use both for the ZIM `Name` metadata and the ZIM filename, for ZIMs published by openZIM.
This is an openZIM convention, i.e. other publishers are free to follow the same convention or develop their own.
Context
- When publishing a ZIM, it's important to pay attention to its metadata as those are the way other people will distinguish it from other content
- Metadata lists the common and required metadata expected for a ZIM file
- None of them needs to be unique. ZIMs already includes an identifier (called ID that is a UUID) that is generated automatically during creation. It doesn't diminishes the value of the other metadata though. You still want readers to easily and confidently choose ZIMs according to those.
- We need to ensure collisions will not happen (two different websites leading to the same ZIM Name typically) and that the user understand which source content he is downloading / using
- Choosing good and appropriate metadata can be difficult, but it's not what this document is about.
This document is about setting valid Name
metadata and filename for openZIM-created ZIMs (usually via the Zimfarm).
Why do we care?
- We create thousands of ZIMs every month. Convention is essential to be able to automate some tasks.
- Convention means applying a pattern, so no need to find what to use: simpler, faster.
- We use
Name
metadata to match Zimfarm-produced ZIMs with *Titles* in the CMS - We use
Name
metadata to set the ZIM filename in most scrapers. - Many scripts depends on the filenames to maintain the central library: build the XML library, move files to appropriate folder, evict older files, generate redirects, etc.
- Offspot YAML catalog uses *Human IDs* that are derived from the filenames.
ZIM Name
Metadata
Format: {domain}_{lang}_{selection}
The _
character is reserved as separator between the parts.
The parts must only contain alphanums or -
or .
characters.
The parts must be all lowercase.
Part | Description | Example |
---|---|---|
domain
|
Domain name (or project) 1 | android.stackexchange.com , wikipedia
|
lang
|
ISO-639 language code or mul 2
|
en , fr , zh , mul
|
selection
|
A short, slug-like string indicating the selection over the project | all , top , football
|
- 1 By default, use the web domain name associated with the content (including for Youtube channels, ...). Project names are exceptions (basically valid only if we at least have a dedicated category for this project); use domain names if unsure, or best, ask on Slack. Should domain name could contains illegal characters for our convention, it will be encoded with Punycode, e.g. https://www.punycoder.com/)
- 2 Whenever possible, prefer to use the ISO-639-1 (2 chars) language code. When the ISO-639-1 code does not exists or is ambiguous (leading to conflict of ZIM Name between two different ZIMs), using the ISO-639-3 is recommended. When multiple languages are present inside the ZIM,
mul
is to be used. Note that the ZIMLanguage
metadata lists all the languages (ISO-639-3) instead of usingmul
ZIM filename
Format: {Name}[_{flavour}]_{period}.zim
The _
character is reserved as separator between the parts.
The parts must only contain alphanums or -
or .
characters.
The filename must be all lowercase.
Part | Description | Example |
---|---|---|
Name
|
The Name metadata described above 1
|
wikipedia_fr_top , wikihow_th_all , stackoverflow.com_en_all
|
flavour
|
Optional. One of the existing flavour indicating a modification of the content for size reasons | mini , nopic , maxi
|
period
|
The period when the ZIM has been created, in format YYYY-MM (year-month) | 2019-03 , 2022-12
|
- 1 It doesn't need to be the equal to the `Name` metadata but requirements identical.
Implementation on the Zimfarm
Depending on the scraper, setting the Name
metadata in the Zimfarm can be mandatory (follow above instructions) or optional. When optional, the scraper usually properly sets it according to the convention. Should it not, open a ticket on the scraper repo and set it manually in the recipe until it is fixed.
Filenames are also optional in the Zimfarm but the common behavior is to append the period-part (ex: _2022-01
after the value of the Name
metadata. If you customized the Name
, make sure the filename will remain valid or set it manually.
Important: when setting filename manually, you are responsible for the whole filename, including the period part. Most scraper allow inserting a special `{period}` string that will be replaced with the year-date one. Ex: supersite.com_en_all_{period}.zim
.