Zimit

From openZIM
Jump to navigation Jump to search

Zimit is a tool allowing to create a ZIM file of "any" Web site.

Context

openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.

Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.

One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.

Principle

The principles of Zimit are:

  • Crawl the remote WebSite to retrieve all the necessary content
  • Save all the retrieved content in WARC file(s)
  • Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)
  • Read the ZIM file in any Kiwix reader

Player

  • the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.
  • In the reader Wabac.js, there is only one specific part related to ZIM content structure and this is in "RemoteWARCProxy". This part knows how to retrieve content from the specific ZIM storage backend. For the rest the code is the same as before.
  • Regarding URL rewriting itself, we have two kinds:
    • The static URL rewriting which is done with Wombat (mostly code-driven)
    • The Fuzzy matching which is done within the ServiceWorker (mostly data-driven)
  • The URL rewriting is done at two levels:
    • When the javascript code calls specific Browsers API, these calls are superseeded and ultimatively call Wonbat
    • When a URL is called, then it goes through the service-worker which does the fuzzy-matching and the URL rewriting.

Source code

Current implementation workflow (to be confirmed)

At creation time

  • Browsertrix create somehow a WARC file.
  • warc2zim is converting the warc file into a zim file. To do so it does:
    • Loop on all records in the WARC file.
    • For each record:
      • Extract the url : "urlkey" if present, else "WARC-Target-URI"
      • Add a `H/<url>`, containing the Headers of the record
      • Add a `A/<url>`, the content (payload) of the record (if record is not a revisist) If content is html, it also insert a small js script which redirect to index.html if SW is not loaded.
    • Add the wabac.js replayer (which also "contains" wombat).
    • Add a "front page" (index.html) which loads the wabac SW when opened.
    • Add a "top frame" page with a iframe and small script (mainly in charge to sync history and icons).

At reading

  • User goes to a page. If SW is not loaded, inserted script redirect to index.html, which load SW and register itself as new collection (using "top frame" as top page) and redirect to request page once collection is added.
  • SW handle the URL, it does:
    • Find the right collection (base on book name)
    • make coll.handleRequest
      • does `getReplayResponse`
        • does store.getResource()
          • Do a request for H/url and if not found, generate "fuzzy url" and do request H/fuzzyurl for each fuzzy url. Once it found a H/(fuzzy)url it stops. If it doesn't found a header return null
          • If header is a revisit, resolve it (by doing another request to H/target_url)
          • At the end, get the payload by doing A/final_url
          • Build a ArchiveResponse with header and payload
      • insert js script loading wombat in the html content.
      • rewrite the ArchiveResponse content.
      • merge headers from ArchiveResponse into the SW response (range, date, set-cookies, ...)
      • return response to requester

   

Wombat is loaded in all pages as a web worker. Js code is wrapped in a wombat context which rewrite outgoing url (fetch/location changes/...) before doing the request itself.

Comparison with pywb.

The workflow of pywb (with a WARC archive) is almost the same but with small simplification as the rewriting part and fuzzymatching is made by the server itself without serviceworker.

  • User goes to a specific url (helped with frontend ui).
  • pywb get the url, search for the record (potentially with fuzzy matching).
  • Once it has the record, it rewrite the payload and it return a response (merging the record's headers in the response).

Rewriting the payload is the same as what is done in the SW (replace html/css link and insert wombat load)

At the end, all links are relative (or point to the server).

Rewriting urls

See documentation at https://pywb.readthedocs.io/en/latest/manual/rewriter.html

All(?) the rewriting is the following :

abs_url -> <server_host>/<collection>/<timestamp><modifier>/abs_url.

  • <collections> is the name for the "set of record" (a warc ?, several ?). In our case, it is the book name
  • <timestamp> is necessary as a collection may contains records for different scrapping. In our case we have one scrapping per book (and so per collection)
  • <modifier> is how we should rewrite the content:
    • id_ is no modification (identical)
    • mp_ is main page. As modification is base on the content type, `mp_` can be applyied to all type of content.
    • js_ and cs_. Force a modification as js or css event if content type is something else (html).
    • im_, oe_, if_, fr_ Historical modifier, same as mp_

Rewriting the content

CSS rewriter : rewrite links

JS rewriter: rewrite few links but mostly wrap the code in a "wombat context".

HTML rewrite: rewrite html and use CSS/JS rewriter as subrewriter for <style>/<script> tags

JSONP rewriter: May rewrite the content base on the request's querystring (!!!!!)

Proposed solution

At creation

Use pywb rewritter module (https://github.com/webrecorder/pywb/tree/main/pywb/rewrite) to statically rewrite the content (record payload) at zim creation time.

Few things can be done statically:

  • <timestamp>: we could remove it (or we know it)
  • <modifier>: depends of the content type and we know it
  • <url>: Is in the record's header

Few things may be not possible to do statically:

  • <server_host>: depends of the production environement (host name, root prefix)
  • <collection>: depends of the zim filename (we may change to base ourselves on zimid ?)
  • <requested_url>: In case of "revisit", pywb and wabac return the content of another record. It rewrite the content based on "the requested url or the record url ?". The same way, in case of fuzzymatching, request url is different than record url.
  • jsonp need access to the "callback" querystring value of the request.

We could do the static rewriting by setting placeholder (${RW_SERVER_HOST}, ${RW_URL}, ...) for things that needs to be rewritten dynamically.

Wombat initialization would be inserted in html page at this step. Wombat itself will be used exactly the same way we use it now (catching url changes/requests coming from js and rewrite it to "local" url)

At reading

General workflow on kiwix-serve (WIP):

For a given requested /content/<book_name>/<url>

  1. Search for zim file corresponding to <book_name>.
  2. Search for C/<url>
    1. If Found => Answer with content of C/url (with dynamic rewrite). If H/url set the http response headers with H/url's headers.
    2. If not found, search for H/url as it may be a revisit
      1. If found, replace `url` by revisit target and do 2.
  3. If no answer by 2.
    1. If fuzzy rules definition is present in the zim files (W/fuzzy_rules ?), generate fuzzy urls and do 2. with each fuzzy rule
  4. If no fuzzy rules match, answer 404

 

This workflow should be compatible with existing zim files (no H nor W/fuzzy_rules).

Searching by C/url first allow to avoid putting a H/url for the common case, even for warc2zim files.

This allow potential fuzzy matching for other zim files (specific scrapper)

Should be pretty "easy" to implement if we defined well:

Notes:

  • Revisit and redirect are different: redirect make kiwix-serve return a 302 to the target. revisit make kiwix-serve answer a 2xx with the content of the target revisit.
  • We may anyway store H revisit as redirect entry in the zim file.

Questions

Kelson

  • How well maintained is the Python server Pywb? Who use it?
  • Do we have other places on top of "RemoteWARCProxy" where we have javascript code dedicated to Kiwix in Wabac/Wonbat?
  • I URL rewriting really data-driven? Same question for Fuzzy-matching?
  • Can we easily use Wombat without the rest of Wabac?

Matthieu

renaud

  • What makes the SW mandatory to replay? What is the constraint that requires it?
  • If not restricted to the sole browser (ie. kiwix-serve or any kiwix reader serving as a dynamic backend), what are the key information that are required for wombat? Just the serving URL? Is the timestamp important?
  • Fuzzy Matching rules are found in wabac, wombat, pywb and warc2zim. Is this redundancy or are tere multiple layers?
  • What's the extent of wombat's role? How far does it go and how required is it?
  • What are “prefix queries”? “prefix search”?
  • How does the replayer cache system works? What's its main purpose? Can it be turned off?
  • What's the difference between a page as (in pages.jsonl) and a `text/html` entry? Status Code only?
  • Is there a WARC testing suite with various use and corner cases ?