Follow Slashdot stories on Twitter


Forgot your password?

Building a Fast Wikipedia Offline Reader 208

ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."
This discussion has been archived. No new comments can be posted.

Building a Fast Wikipedia Offline Reader

Comments Filter:
  • Wow! (Score:3, Funny)

    by ferrocene ( 203243 ) on Monday August 13, 2007 @10:56PM (#20220565) Journal
    After doing all that, I think you may have missed your flight! :)
  • Ho-Hum ... (Score:5, Funny)

    by jabberwock ( 10206 ) on Monday August 13, 2007 @11:21PM (#20220715) Homepage
    What, no auto update? No User Agreement? No disabled features that are enabled by a mammoth key? No product registration?

    Let us know when you're ready for prime time ... ;-)

    • Re:Ho-Hum ... (Score:5, Insightful)

      by OzRoy ( 602691 ) on Tuesday August 14, 2007 @01:18AM (#20221501)
      Auto-update would be interesting. How do you keep the data up to date without downloading the entire 2.9G again? Is there some sort of diff file you can download?
      • Re: (Score:3, Funny)

        How do you keep the data up to date without downloading the entire 2.9G again?

        Not too hard if you have a sub-etha net connection handy. Better check that the article about The Earth which you have been working on hasn't been cut down to two words though.

  • I hope (Score:4, Funny)

    by Nikron ( 888774 ) on Monday August 13, 2007 @11:36PM (#20220817)
    That you don't dump the wiki at a bad time.

    George W Bush

    Is a dick head!!!!11

    • Re:I hope (Score:5, Funny)

      by Anonymous Coward on Monday August 13, 2007 @11:58PM (#20220979)
      You mean before someone makes it inaccurate again?

      Oh, nevermind, I see the problem:

      George W Bush

      Is a dick head!!!!11

      should be

      George W Bush

      Is a dick head!!!!!!

      Man, those out to mess with the content are getting more and more subtle...
    • by Ours ( 596171 )
      Looks fine by me.
  • But... (Score:2, Funny)

    by Anonymous Coward
    What's the point of it if there are no vandals or flame wars to make it interesting?
  • by Brietech ( 668850 ) on Monday August 13, 2007 @11:45PM (#20220879)
    Combine this and one of the new E-ink ebook readers, make it pretty rugged, slap a solar panel on the back and man. . . you have something really close to a genuine hitchhiker's guide to the galaxy. Ah, I love where technology is heading =)
    • by Sneftel ( 15416 ) on Monday August 13, 2007 @11:55PM (#20220949)
      As long as hitchhikers primarily need to know how to evolve a Pikachu into a Raichu, and how Benjamin Disraeli has been referenced in pop culture.
    • by dch24 ( 904899 )
      Don't forget to put this on the cover, in large reassuring letters:

    • by Gromius ( 677157 ) on Tuesday August 14, 2007 @02:53AM (#20221905)
      Yes its a perfect fit. Particularly as Wikipedia has now supplanted the Encyclopedia Britannica in many places as the standard repository of all knowledge and wisdom. Although it has many omissions, contains much that is apocryphal, or at least widely inaccurate, it scores over the older more pedestrian work in two important ways.

              * 1. It is slightly cheaper
              * 2. It has the words "You can copy and edit me for free" inscribed in large friendly letters in the license.

      Also like the guide, although it cannot hope to be useful or informative on all matters, it does make the reassuring claim that where it is inaccurate, it is at least definitively inaccurate :)

  • by Anonymous Coward

    I was able to build this in two days, most of which were spent searching for the appropriate tools. Simply unbelievable... toying around with these tools and writing less than 200 lines of code, and... presto!
    Give that man a job at Google.
  • by phliar ( 87116 ) on Tuesday August 14, 2007 @12:27AM (#20221185) Homepage
    For a change it's not just a link to a .tar.gz somewhere, but an actual article where he goes through what he did, and (more important) why he did things that way. Good reading even if you don't want an off-line Wikipedia.
  • It doesn't take days (Score:5, Informative)

    by BReflection ( 736785 ) on Tuesday August 14, 2007 @01:10AM (#20221467) Homepage
    It only takes days if you use the php import script to import the sql dump, which was not designed for importing the entire dump.

    Use the ANSI C implementation, which takes about 20 minutes to convert the XML to SQL and then takes a few hours to import into MySQL. Please not that you need a properly configured MySQL server in order to efficiently run a local copy of Wikipedia, which must have at least 8GB of ram. []
    • By the way, he could have saved himself a lot of time if he would have just purchased a WikiStick []
      • by jschrod ( 172610 )
        OK, I bite.

        Your URL leads to a domain parking page. Google search for Wikistick didn't bring results on the first page either. AFAIK, full Wikipedia (text and images) is too large for a USB stick.

        What did you want to tell us?

        • is it? [] I read the DB is 2.9Gb which is all he's taken. You don't *need* the images, but I image they'd take a fair amount of space!
    • by Ant P. ( 974313 )

      Please not that you need a properly configured MySQL server in order to efficiently run a local copy of Wikipedia, which must have at least 8GB of ram.
      In other words, your post is a complete waste of space given that he said in TFS that this is for a laptop.
  • by Splab ( 574204 ) on Tuesday August 14, 2007 @01:32AM (#20221557)
    is very very slow when you do it on a normal installation, the reason is MySQL comes with a "be nice to people who don't know what they are doing" setup. Go into the my.cnf and find the buffer settings, crank them up and restart the server. It can really do a lot (especially if you are running InnoDB which you of course are since MyISAM isn't a proper database).
  • What?? (Score:5, Funny)

    by icydog ( 923695 ) on Tuesday August 14, 2007 @01:48AM (#20221617) Homepage
    TFA is:

    1. Not a thinly-veiled attempt to advertise a crappy product
    2. Not bashing Microsoft
    3. Not about somebody who is trolling open-source (i.e. SCO)
    4. Not about Bush taking away all our rights and ending freedom
    5. Not about voting fraud and the end of democracy/America/the world
    6. Not decrying Vista DRM and its ties to the MAFIAA
    7. Posted on Slashdot

    Furthermore, TFA is interesting and informative.

    Am I in heaven?
  • I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?

    The beauty of it is that it is online and always up-to-date (wrong, or less wrong).

    Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.

    If it's an academic project, that's really cool, but I don't see a practical point to it.

    • Re:The Point? (Score:5, Insightful)

      by Mr. Roadkill ( 731328 ) on Tuesday August 14, 2007 @03:26AM (#20222067)

      I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?
      Ummm... I think the whole point is, as you've pointed out, that not everyone has a permanent connection to the net everywhere they go. Or maybe they don't have access to everything they'd like even if they *do* have net access everywhere, or want to pay extravagant data rates while out and about.

      Joe has all-you-can-eat broadband at home, or an understanding employer with a fat pipe, and spends two hours each day on the train. Two and a half gig per month (and lets face it, you probably don't want to update it more frequently that that) and he's got probably half his reading material sorted out.

      Wang lives in Buttfuckistan, a fictional country with totalitarian leanings with too many real-world counterparts. The Great Firewall of Buttfuckistan (i.e. squidguard, under the control of Buttfuckistan Telecom, and settings in the routers to drop non-port-80 traffic half the time) makes it impossible to reliably access Wikipedia from inside their borders, which is a great shame because the entry on Buttfuckistan is particularly unflattering. Once a month, Joe sticks a DVD with five minutes from an old re-run of Friends and an encrypted dump of Wikipedia in an airmail envelope and sends it to Wang.

      Mary is still at secondary school, and her particular school has wifi access for students who are encouraged to purchase their own laptops, but since the local pastor discovered shermans_wife_hokusai.jpg [] they've been forced to add wikipedia to the school's blocklist. Which is a pity, because it's a great first-approximation source for material or research directions, but there you go. Mary can make a local copy through her home broadband connection, and can access it locally on her laptop wherever she goes - even at school, or church. Bill, Jillian and Mungo (the pastor's son) find out about this, and now all four of them take it in turns to make the copy each month, sharing the bandwidth costs. Their friends Harry and Sally, who don't have broadband but are great friends of the other four, also get copies... and there are plans to distribute the copies further, as a kind of teenage grass-roots knowledge-sharing and social-justice effort.

      Still can't see the point?
    • by Riktov ( 632 )
      The beauty of it is that it is online and always up-to-date (wrong, or less wrong).

      Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.

      Sure, what's the point of reading an old version of the history of the Battle of Hastings, or the technical specifications of the P-51 Mustang, or the characteristics of a dominant seventh chord? After six months, it's complete obsolete and worthless, right?
  • How is this different to moulin which is a fully interactive, offline version of the entire Wikipedia (without pictures) on a CD-ROM: []
  • There's a one-button (for admins) export-the-whole-wiki-as-html feature in modern MediaWiki software installs...

    But hey, two days and a few hundred lines of code is cool. You geek (verb). If we always took the easy way out we'd be using Windows and have committed suicide long ago.
  • by mu22le ( 766735 ) on Tuesday August 14, 2007 @04:28AM (#20222305) Homepage Journal
    A PSP is very portable (fits in your sweater/backpack), hackable, and has up to 8Gb of storage. I have been dreaming for an year about porting wikipedia to it. Unfortunately I'm not familiar with the kind of programming needed and I could never find the time...
  • by dannycim ( 442761 ) on Tuesday August 14, 2007 @05:25AM (#20222493)
    There's a serious problem with the article's way of treating the data that I didn't see addressed.

    The wikipedia database file is one large bzip2'ed XML file which the author splits into blocks of 900k (bzip2's natural blocking) which he then parses for the "title" and "text" XML tags.

    The problem with that approach is that some of these tags may well end up being split over block boundaries, so some articles risk being missed. EG:

    END-OF-BLOCK: blablablabla...blabla[/text][othertag][ti

    START-OF-NEXT-BLOCK: tle][sometag]blablablablabla...

    So searching for "[title]" in boths blocks separately like TFA does will fail for one article.

    (I've used square brackets instead of lessthans and greaterthans because slashdot won't let me use them.)
    • Oh, and I forgot: Articles' text which also end up over block boundaries will appear truncated.
    • by ttsiod ( 881575 ) on Tuesday August 14, 2007 @12:53PM (#20226505) Homepage
      (Re-post: for some reason the response I sent some hours ago didn't appear) No, actually there is no bug. If you read the contents of the '' script, you'll see that it adapts to a missing '</text>' by reading from the next volume - the next recxxx...bz2.

      As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space. Since "text" and "title" appear all the time in these blocks (at least once for each article), they will NOT be split - they will be encoded as "references", and therefore, what you describe shouldn't happen (I hope :-)

  • I didn't know about the wikipedia raw database, or I'd probably have done something like this myself, and hooked it into the UNIX "locate" db, or Spotlight, or maybe...

    $ man w locate
    GNU Locate
    From Wikipedia, the free encyclopedia
      (redirected from Locate)
    This software-related article is a stub. You can help Wikipedia by expanding it.
  • Anyone with experience of sdict []?

    They offer a dictionary reader for various systems, including portable devices, and dictionaries including Wikipedia.

    Unfortunately their Wikipedia dict is a old (January), but it seems like a good approach for laptops or other small devices. When I get an 8Gb SDHC I'm going to try it on my Nokia N800.

  • While interesting, it's certainly reinventing the wheel. There are lots of methods for doing this found on the site itself ( o wnload [] ) including static content already marked up.

    Also, as others have noted, the his choping the file into chunks means you're going to loose at least one article per chunk.

    I'd implemented this with a compressed file system and maybe some symlinks. Happily, the static content is already there for the taking. Some find and grep o
  • I didn't see any posts on this so I thought I'd bring it up. I think the author took the long way around.

    The author did some nifty hacking that resulted in the following stack of dependencies:

    * Perl 5.8.5
    * Python 2.5
    * PHP 5.2.1
    * Xapian 1.0.2
    * Django 0.9.6

    He cited not wanting to use a RDBMS since he's not writing to the database, just reading. I can give him that,
  • I live in an area with fairly bad mobile signal - I'm always trying to look things up on Wikipedia but finding I can't. Fortunately my Treo 680 smartphone can take 8GB SDHC cards ( []), so I could fit this on with room to spare for MP3s and photos, and future growth of Wikipedia. Very tempting, though I'd need to port it to something like Lua and GCC - obviously the porting would be fairly trivial by the time Palm releases its Linux-based Treos in early 2008...

"An open mind has but one disadvantage: it collects dirt." -- a saying at RPI