Difference between revisions of "Content team"

From openZIM
Jump to navigation Jump to search
 
(33 intermediate revisions by 3 users not shown)
Line 7: Line 7:
* Book curation must remain focused on educational material, broadly construed;
* Book curation must remain focused on educational material, broadly construed;
* Books should have proper visual formatting;
* Books should have proper visual formatting;
* Books should be up-to-date;
* Books should be up-to-date like custom apps;
* The Kiwix Library should allow easy and friendly discovery of content.
* The Kiwix Library should allow easy and friendly discovery of content.
Priority should always be put on maintaining the existing content over creating new ones.


== Responsabilities ==
== Responsabilities ==
Line 58: Line 60:
=== Scraping ===
=== Scraping ===
* Scraping leadership means the initiative should come from the content team
* Scraping leadership means the initiative should come from the content team
* Zimfarm activity should be monitored periodically
* First analysis of error should be done by content team
* First analysis of error should be done by content team
* If error in scraper is suspected
* If error in scraper is suspected
Line 75: Line 78:


=== Scraping ===
=== Scraping ===
==== Zimfarm monitoring ====
Zimfarm activity should be monitored periodically (at the daily or weekly frequency).
Should be systematically treated:
* Failed recipes
* Recipes taking longer than usual
* Worker disappearance


=== Library Management ===
=== Library Management ===
Line 80: Line 91:
=== Custom Apps ===
=== Custom Apps ===


== Worflows ==
== Workflows ==
 
=== Scraping ===
 
==== Create a Youtube recipe ====
 
To create a new recipe to scrape videos from a Youtube Channel/Username or one-or-more Playlists.
 
It’s recommended to clone an existing Youtube recipe.
 
* In "Content settings":
# Create the recipe name as per [https://github.com/openzim/overview/wiki/Naming-Convention the naming conventions].
# In the Language space, choose the language(s) of the Youtube page you are creating the recipe for.
# From Category space, choose (other)
# From warehouse path space, choose "/.hidden/.dev" always as a first time in order to test the resulted ZIM file.
# if the file is tested and all is correct then you update the recipe with the proper path "videos". Otherwise tune the recipe and relaunch a task.
# Make sure the Status is set to Enabled.
# You can choose Periodicity to be monthly or quarterly. Use monthly per default.
 
* In "Task settings":
# In Offliner space choose: Youtube
# In platform space choose Youtube.
# Keep the rest the same with no change.
 
*In "Scraper settings: youtube command flags":
# In Playlist mode: choose (Not Set) if you are doing the recipe for a whole channel.
# If you are doing the recipe for a playlist, choose (Set).
# In Type: choose (Channel) or (Playlist) as per your required file.
# In Youtube ID: type the ID of the channel or the playlist.
# For the API Key: There is a list of keys mostly as per the channel or the playlists sizes, ask for the list to choose the appropriate API Key.
# In ZIM Name: the recipe name as per the naming conventions [here](https://github.com/openzim/overview/wiki/Naming-Convention).
# In Title: type the name you want for the output file.
# Description: type a short description of your required zim file.
# Leave Optimisation Cache URL as it is (cloned from old recipe).
# Leave the rest of the fields empty or as per the cloned recipe.
# Finally, click in the bottom on (Update offliner details).
# Review all your entries once again, then go back to the top of the page and click on (Request).
# After about an hour, check the recipe if it failed or succeeded (or the next day if the source website is large).
# If successful, go to this link ([dev.library.kiwix.org](https://dev.library.kiwix.org/)) and check your created file, check the size and check if the file is working properly. If the file does not appear, wait a bit as updates are made every 15 minutes.
# If the file looks good and complete, go back to your recipe, In warehouse path space, change(/.hidden/.dev) to the proper category related to your file content (Wikipedia, Wikihow, … etc).
# Click on Update offliner details and then click on Request again.
# Finally, check the file in [https://library.kiwix.org/ Kiwix Content Library]. If all is good, do not forget to go back to [https://github.com/openzim/zim-requests/issues the initial ticket] and put the link of the output file and close the ticket.
 
 
==== Change a recipe/ZIM warehouse path and/or a ZIM name ====
Changing the warehouse path of a recipe, once a first ZIM has already been produced, is not a negligible action. It has impact on the library and on the [https://imager.kiwix.org Kiwix Hotspot Imager]. Therefore, actions must be coordinated.
 
It is hence mandatory that whenever a recipe needs to change its warehouse path, [https://github.com/openzim/zim-requests openzim/zim_requests a ticket has to be open at GitHub] and assigned to both @RavanJAltaie, @benoit74 and @rgaudin for proper coordination:
 
# Disable the recipe in Zimfarm (''a priori'' @RavanJAltaie)
# Wait until there are no more in-progress Orders in the Kiwix Hotspot Imager that include those ZIMs (''a priori'' @rgaudin)
#Put Kiwix Hotspot Imager in maintenance (''a priori'' @rgaudin)
# Move existing ZIMs on the file server (''a priori'' @benoit74)
# Trigger catalog refresh right after so any Imager Order / download created right after uses the new URL (''a priori''  @rgaudin)
#Remove Kiwix Hotspot Imager from maintenance (''a priori'' @rgaudin)
# Update the warehouse path / ZIM name in Zimfarm (''a priori''  @RavanJAltaie)
# Re-enable the recipe in Zimfarm (''a priori''  @RavanJAltaie)
 
==== Abnormal scrapes duration ====
At least twice a week, analyze ongoing tasks at and report any task with an abnormal duration.
 
An abnormal duration is a task which takes longer than 7 days to complete AND which usual duration is either unknown (new tasks) or significantly lower than current duration (few days at least).
 
For each task with an abnormal duration, an issue must be created in ''[https://github.com/openzim/zim-requests/issues openzim/zim-request]'' with the [https://github.com/openzim/zim-requests/issues/new/choose "Zimfarm Task Duration Issue" template].
 
The issue must be assigned to the task requestor if the task has been manually requested in the Zimfarm, @RavanJAltaie otherwise.
 
==== Failed scrapes ====
Whenever possible, and at least twice a week, it is necessary to report failed tasks in GitHub issues.
 
If the task failure is linked to an new recipe currently being fine-tuned, the failure must be reported in the corresponding existing issue in [https://github.com/openzim/zim-requests/issues openzim/zim-request].
 
If the task failure is linked to a recipe which already has a "Zimfarm Recipe Issue" issue opened in [https://github.com/openzim/zim-requests/issues openzim/zim-request], then a new comment must be added to the issue.
 
Otherwise, a new issue must be created in `openzim/zim-request` at GitHub with the [https://github.com/openzim/zim-requests/issues/new/choose "Zimfarm Recipe Failure Issue" template]. The issue must first be assigned to @RavanJAltaie for first diagnosis.
 
In both cases, the "Bug" label must be placed on the issue.
 
==== Diagnose "Zimfarm Recipe Failure Issue" issues ====
You may do the first diagnosis only if the issue is assigned to you. If the issue is assigned to someone else, please ask for permission first. This rule can be bypassed for obvious reasons is the person is on long leave, sick, ...
 
This diagnosis is expected to be done within few days, less than a week at most.
 
To diagnose "Zimfarm Recipe Issue", following criteria have to be analyzed:
# Is this the first failure of the recipe in a row?
# Do we have a previous task that worked well?
# Do we miss an obvious error message in the scraper log that indicates the recipe is doomed to fail if ran again?
 
If the answer is "yes" to all three questions, then the recipe must be requested again, this might have been a temporary failure.
 
Otherwise, either the recipe parameters have to be adjusted if the fix is obvious (e.g. "Title is too long error", ...) and the recipe requested again, or the issue must be raised to @benoit74 for analysis.
 
==== Diagnose "Zimfarm Task Duration Issue" issues ====
You may do the first diagnosis only if the issue is assigned to you. If the issue is assigned to someone else, please ask for permission first. This rule can be bypassed for obvious reason is the person is on long leave, sick, ...
 
This diagnosis is expected to be done within few hours, less than few days at most.
 
To diagnose "Zimfarm Task Duration Issue", following criteria have to be analyzed:
# Do we still have signs of activity in scraper log (e.g. a log from less than 1 day ago) ?
# For scraper reporting progress, are the progress number relevant to have completion within 30 days ?
# Is the task running for less than 30 days ?
 
If the answer is "yes" to all three question, then you should let the task continue and reassess within few days.
 
If the answer is "no" to any of these questions, then the issue must be raised to @benoit74 for analysis.
 
=== Library Management ===
==== Deleting a ZIM ====
Deleting a ZIM which has already been published is not a negligible action. It has impact on the library and on the [https://imager.kiwix.org Kiwix Hotspot Imager], where actions must be coordinated.
 
It is hence mandatory that, whenever a recipe/ZIM needs to be deleted, [https://github.com/openzim/zim-requests openzim/zim_requests a ticket is opened on GitHub] and assigned to both @benoit74 and @rgaudin for proper coordination:
# Add a delete marker on storage (if <code>zim/zimit/my_zim.zim</code> needs to be removed from catalog, you have to "touch" <code>zim/zimit/my_zim.delete</code>)
#Wait for library catalog to be regenerated
#Check that there are no more in-progress Orders in the Kiwix Hotspot Imager that include those ZIMs
# Delete ZIM (and delete marker) from the file server
 
''Nota'': Moving a file to the archive has to be considered as a file deletion.
 
==== Demo a ZIM ====
From time to time, we need to demo a ZIM to a customer before releasing it into the wild. We have a demo instance at https://demo.library.kiwix.org/
 
Configuration is done through the file at https://github.com/kiwix/operations/blob/main/zim/demo-library/demos.yaml ; should you need to create a new demo, modify or delete an existing one, simply open a PR with your modifications on this file and ask @rgaudin or @benoit74 for review.
 
Every ZIM can be referenced either by full path or by path up-to-the-date, I which case most recent one will be automatically selected at each configuration redeployment.
 
Once merged, this configuration is automatically redeployed every hour, so once your PR is merged give it a bit of time to be deployed.
 
After that, send the demo URL to our client, e.g. <nowiki>https://demo.library.kiwix.org/</nowiki>'''home/'''my_demo/ ; this URL will be updated every time you modify the configuration or ZIMs gets updated.
 
It is now '''forbidden''' to send a link on https://dev.library.kiwix.org to a customer, this is not an infrastructure meant to be highly available and can be shutdown at any time without notice.
 
''Nota:''
 
- all demos must have an expired_on property, and they are automatically removed at this date
 
- this infrastructure can serve any ZIM available on our storage (public and hidden ones)


- adding or removing a demo or ZIMs does not make any modification to the ZIMs stored in our storage


== Members ==
== Members ==
* [https://github.com/Popolechien Popolechien], manager in line
* [https://github.com/Popolechien Popolechien], manager in line
* [https://github.com/RavanJAltaie Ravan], content manager
* [https://github.com/benoit74 Benoit74], scrapers lead dev
* [https://github.com/benoit74 Benoit74], scrapers lead dev


== See also ==
== See also ==
* [[Content strategy]]
* [[Content strategy]]

Latest revision as of 12:16, 9 December 2024

The Content team gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives).

Purpose

Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.

Goals

  • Book curation must remain focused on educational material, broadly construed;
  • Books should have proper visual formatting;
  • Books should be up-to-date like custom apps;
  • The Kiwix Library should allow easy and friendly discovery of content.

Priority should always be put on maintaining the existing content over creating new ones.

Responsabilities

  • Content Requests
    • Collaborate with requesters to qualify requests properly. Keep them informed.
    • Ensure we are allowed and able to fullfill requests
    • Initiate new recipes and manage first publishing if new book
    • Collaborate with scraper dev. team if necessary
    • Keep the tickets up2date
  • Scraping
    • Ensure Zimfarm works fine and contribute to its improvements with dev. team
    • Analyses failures or unexpected behaviors
    • Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team
    • Ensure workers are online and are properly configured
    • Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)
  • Library management
    • Ensure ZIM filenames and location (paths) are correct
    • Ensure ZIM Metadata are correct
    • Ensure ZIM are recent and kept up2date (AFAP)
    • Ensure library is coherent and user-friendly

Policies

Publishing

  • Content has to be legal in Switzerland
  • Content should not advertise fringe theory
  • Content should betterne free content
  • If not free, content should be:
    • Open content OR
    • Educational content OR
    • has an authorization of reproduction
  • Any content we publish should
    • have (almost) no user visible error
    • have proper/correct metadata
    • be easily discoverable in the public library

Content Requests

  • Allow everybody to request new, changes or deletion of content
  • In full transparency track the lifecycle of our content portfolio
  • New content should be assessed and vetted content against publishing policy (see above)
  • Content requests should be closed:
    • when fully implemented (user visible)
    • if refusal or impossibility of implementation
  • ZIM Medata should be given for new content
  • Only once all prerequisites are satisfied, then start with scraping

Scraping

  • Scraping leadership means the initiative should come from the content team
  • Zimfarm activity should be monitored periodically
  • First analysis of error should be done by content team
  • If error in scraper is suspected
    • Issue should be updated to corresponding scraper code repository
    • Scraper problem analysis does not super-seed in any manner content request
  • ZIM quality should be vetted against publishing policy
  • Any recipe should run successfully first in dev before been put in production
  • Hardware resources should be saved

Library Management

Custom Apps

Processes

Content Requests

Scraping

Zimfarm monitoring

Zimfarm activity should be monitored periodically (at the daily or weekly frequency).

Should be systematically treated:

  • Failed recipes
  • Recipes taking longer than usual
  • Worker disappearance

Library Management

Custom Apps

Workflows

Scraping

Create a Youtube recipe

To create a new recipe to scrape videos from a Youtube Channel/Username or one-or-more Playlists.

It’s recommended to clone an existing Youtube recipe.

  • In "Content settings":
  1. Create the recipe name as per the naming conventions.
  2. In the Language space, choose the language(s) of the Youtube page you are creating the recipe for.
  3. From Category space, choose (other)
  4. From warehouse path space, choose "/.hidden/.dev" always as a first time in order to test the resulted ZIM file.
  5. if the file is tested and all is correct then you update the recipe with the proper path "videos". Otherwise tune the recipe and relaunch a task.
  6. Make sure the Status is set to Enabled.
  7. You can choose Periodicity to be monthly or quarterly. Use monthly per default.
  • In "Task settings":
  1. In Offliner space choose: Youtube
  2. In platform space choose Youtube.
  3. Keep the rest the same with no change.
  • In "Scraper settings: youtube command flags":
  1. In Playlist mode: choose (Not Set) if you are doing the recipe for a whole channel.
  2. If you are doing the recipe for a playlist, choose (Set).
  3. In Type: choose (Channel) or (Playlist) as per your required file.
  4. In Youtube ID: type the ID of the channel or the playlist.
  5. For the API Key: There is a list of keys mostly as per the channel or the playlists sizes, ask for the list to choose the appropriate API Key.
  6. In ZIM Name: the recipe name as per the naming conventions [here](https://github.com/openzim/overview/wiki/Naming-Convention).
  7. In Title: type the name you want for the output file.
  8. Description: type a short description of your required zim file.
  9. Leave Optimisation Cache URL as it is (cloned from old recipe).
  10. Leave the rest of the fields empty or as per the cloned recipe.
  11. Finally, click in the bottom on (Update offliner details).
  12. Review all your entries once again, then go back to the top of the page and click on (Request).
  13. After about an hour, check the recipe if it failed or succeeded (or the next day if the source website is large).
  14. If successful, go to this link ([dev.library.kiwix.org](https://dev.library.kiwix.org/)) and check your created file, check the size and check if the file is working properly. If the file does not appear, wait a bit as updates are made every 15 minutes.
  15. If the file looks good and complete, go back to your recipe, In warehouse path space, change(/.hidden/.dev) to the proper category related to your file content (Wikipedia, Wikihow, … etc).
  16. Click on Update offliner details and then click on Request again.
  17. Finally, check the file in Kiwix Content Library. If all is good, do not forget to go back to the initial ticket and put the link of the output file and close the ticket.


Change a recipe/ZIM warehouse path and/or a ZIM name

Changing the warehouse path of a recipe, once a first ZIM has already been produced, is not a negligible action. It has impact on the library and on the Kiwix Hotspot Imager. Therefore, actions must be coordinated.

It is hence mandatory that whenever a recipe needs to change its warehouse path, openzim/zim_requests a ticket has to be open at GitHub and assigned to both @RavanJAltaie, @benoit74 and @rgaudin for proper coordination:

  1. Disable the recipe in Zimfarm (a priori @RavanJAltaie)
  2. Wait until there are no more in-progress Orders in the Kiwix Hotspot Imager that include those ZIMs (a priori @rgaudin)
  3. Put Kiwix Hotspot Imager in maintenance (a priori @rgaudin)
  4. Move existing ZIMs on the file server (a priori @benoit74)
  5. Trigger catalog refresh right after so any Imager Order / download created right after uses the new URL (a priori @rgaudin)
  6. Remove Kiwix Hotspot Imager from maintenance (a priori @rgaudin)
  7. Update the warehouse path / ZIM name in Zimfarm (a priori @RavanJAltaie)
  8. Re-enable the recipe in Zimfarm (a priori @RavanJAltaie)

Abnormal scrapes duration

At least twice a week, analyze ongoing tasks at and report any task with an abnormal duration.

An abnormal duration is a task which takes longer than 7 days to complete AND which usual duration is either unknown (new tasks) or significantly lower than current duration (few days at least).

For each task with an abnormal duration, an issue must be created in openzim/zim-request with the "Zimfarm Task Duration Issue" template.

The issue must be assigned to the task requestor if the task has been manually requested in the Zimfarm, @RavanJAltaie otherwise.

Failed scrapes

Whenever possible, and at least twice a week, it is necessary to report failed tasks in GitHub issues.

If the task failure is linked to an new recipe currently being fine-tuned, the failure must be reported in the corresponding existing issue in openzim/zim-request.

If the task failure is linked to a recipe which already has a "Zimfarm Recipe Issue" issue opened in openzim/zim-request, then a new comment must be added to the issue.

Otherwise, a new issue must be created in `openzim/zim-request` at GitHub with the "Zimfarm Recipe Failure Issue" template. The issue must first be assigned to @RavanJAltaie for first diagnosis.

In both cases, the "Bug" label must be placed on the issue.

Diagnose "Zimfarm Recipe Failure Issue" issues

You may do the first diagnosis only if the issue is assigned to you. If the issue is assigned to someone else, please ask for permission first. This rule can be bypassed for obvious reasons is the person is on long leave, sick, ...

This diagnosis is expected to be done within few days, less than a week at most.

To diagnose "Zimfarm Recipe Issue", following criteria have to be analyzed:

  1. Is this the first failure of the recipe in a row?
  2. Do we have a previous task that worked well?
  3. Do we miss an obvious error message in the scraper log that indicates the recipe is doomed to fail if ran again?

If the answer is "yes" to all three questions, then the recipe must be requested again, this might have been a temporary failure.

Otherwise, either the recipe parameters have to be adjusted if the fix is obvious (e.g. "Title is too long error", ...) and the recipe requested again, or the issue must be raised to @benoit74 for analysis.

Diagnose "Zimfarm Task Duration Issue" issues

You may do the first diagnosis only if the issue is assigned to you. If the issue is assigned to someone else, please ask for permission first. This rule can be bypassed for obvious reason is the person is on long leave, sick, ...

This diagnosis is expected to be done within few hours, less than few days at most.

To diagnose "Zimfarm Task Duration Issue", following criteria have to be analyzed:

  1. Do we still have signs of activity in scraper log (e.g. a log from less than 1 day ago) ?
  2. For scraper reporting progress, are the progress number relevant to have completion within 30 days ?
  3. Is the task running for less than 30 days ?

If the answer is "yes" to all three question, then you should let the task continue and reassess within few days.

If the answer is "no" to any of these questions, then the issue must be raised to @benoit74 for analysis.

Library Management

Deleting a ZIM

Deleting a ZIM which has already been published is not a negligible action. It has impact on the library and on the Kiwix Hotspot Imager, where actions must be coordinated.

It is hence mandatory that, whenever a recipe/ZIM needs to be deleted, openzim/zim_requests a ticket is opened on GitHub and assigned to both @benoit74 and @rgaudin for proper coordination:

  1. Add a delete marker on storage (if zim/zimit/my_zim.zim needs to be removed from catalog, you have to "touch" zim/zimit/my_zim.delete)
  2. Wait for library catalog to be regenerated
  3. Check that there are no more in-progress Orders in the Kiwix Hotspot Imager that include those ZIMs
  4. Delete ZIM (and delete marker) from the file server

Nota: Moving a file to the archive has to be considered as a file deletion.

Demo a ZIM

From time to time, we need to demo a ZIM to a customer before releasing it into the wild. We have a demo instance at https://demo.library.kiwix.org/

Configuration is done through the file at https://github.com/kiwix/operations/blob/main/zim/demo-library/demos.yaml ; should you need to create a new demo, modify or delete an existing one, simply open a PR with your modifications on this file and ask @rgaudin or @benoit74 for review.

Every ZIM can be referenced either by full path or by path up-to-the-date, I which case most recent one will be automatically selected at each configuration redeployment.

Once merged, this configuration is automatically redeployed every hour, so once your PR is merged give it a bit of time to be deployed.

After that, send the demo URL to our client, e.g. https://demo.library.kiwix.org/home/my_demo/ ; this URL will be updated every time you modify the configuration or ZIMs gets updated.

It is now forbidden to send a link on https://dev.library.kiwix.org to a customer, this is not an infrastructure meant to be highly available and can be shutdown at any time without notice.

Nota:

- all demos must have an expired_on property, and they are automatically removed at this date

- this infrastructure can serve any ZIM available on our storage (public and hidden ones)

- adding or removing a demo or ZIMs does not make any modification to the ZIMs stored in our storage

Members

See also