Thursday, October 31, 2013

using the Internet Archive bookreader offline

OK, this is a work in progress, but I wanted to take some notes on a little project. I love the internet archive (https://archive.org/), especially now that I have a Kindle and a Kindle Fire. I also really like their online bookreader, which is a visually pleasing javascript browser for original images of scanned book pages. It is especially nice for books in which layout, typesetting, images, etc. are important (since these are generally lost or mangled in OCR-based reflowable text files for eReaders). I recently read The perception of the visual world by James Gibson entirely online using it!

Great, but what if I want to be able to read when I am not connected to the Internet? (I don't know why I am so worried about having local copies of some things, it is some digital hoarding tendency.) It turns out that the bookreader is an open source project, entirely downloadable from github! Some documentation is here.

Easy, I thought. I downloaded the zip file, unzipped it, and was off and running with the BookReaderDemo (double-click index.html and it should open in your browser). It took a few small tricks to get it to work with a new book downloaded from the archive. I decided to start with On growth and form by D'Arcy Wentworth Thompson. I downloaded all of the files (under All files: Torrent. I unzip the "ongrowthform1917thom_flippy.zip" folder, which contains a bunch of .jpg files of the pages. I copy them all into a directory called "jpgImages" inside "BookReaderDemo". Then I change BookReaderJSSimple.js in the following ways:

line 11 to     return 365;
line 16 to     return 600;
line 80 to br.numLeafs = 708;

(Just changing the width, height, and number of pages here, not entirely sure these are correct.)

Finally the important part, telling it where to find the new images:
line 28 to         var url = 'jpgImages/'+leafStr.replace(re, imgStr) + '.jpg';

This worked, but the pages all looked low-quality and blurry. I poked around, and it looks like internet archive actually uses jpeg 2000 (.jp2) for the real full-quality images. These are stored in the original download in the ongrowthform1917thom_jp2.zip file. I unzipped this, the problem is that just opening the new directory crashes nautilus! I guess jp2 hasn't exactly gotten widely accepted... I downloaded imagemagick by running sudo apt-get install imagemagick

Now I could navigate to the directory containing the .jp2 files and convert them to .jpg with the command mogrify -format jpg *.jp2

Now if I change line 80 to var url = 'jp2Images/ongrowthform1917thom_'+leafStr.replace(re, imgStr) + '.jpg';
(after copying the jp2 directory to BookReaderDemo and renaming it jp2Images) it looks like I get a functioning bookreader! I'm sure their implementation with jp2 files is faster and has other advantages, but I'm just happy to get something working relatively quickly.

Finally, let's see how well it works to embed the book (still hosted on archive.org) here:


1 comment:

simplemind said...

love the idea but too complex. can you simplify?