Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Programming

Building a Fast Wikipedia Offline Reader 208

ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."
This discussion has been archived. No new comments can be posted.

Building a Fast Wikipedia Offline Reader

Comments Filter:
  • Re:Why? (Score:3, Insightful)

    by bn557 ( 183935 ) on Monday August 13, 2007 @11:16PM (#20220685) Homepage Journal
    This may seem like a stupid, trivial, and pointless project, but the programmer may have gained something from it that he could use later in something you don't feel that way about. If the programmer enjoyed doing it, that might have lead to a more productive coding session later in the day too.
  • Re:Uh.... (Score:2, Insightful)

    by Tablizer ( 95088 ) on Monday August 13, 2007 @11:54PM (#20220947) Journal
    I love programming useless things just for the challenge.

    Have you ever worked on a project called "Clippey", by chance?
         
  • by Anonymous Coward on Tuesday August 14, 2007 @12:06AM (#20221035)
    And that doesn't happen offline? Only naive people like you need to be worried about reading Wikipedia.

    There are bastards of every academic, social, and financial background.
  • by Tacvek ( 948259 ) on Tuesday August 14, 2007 @12:14AM (#20221083) Journal
    My very serious question to you is how much better do you think things are at a "real" encyclopedia. They have many of the same problems, but they are just not public. "Real" encyclopedias can be just an inaccurate as the Wikipedia on many articles. For a quick first reference, Wikipedia is an ideal tool. Just be sure to take things with a grain of salt if you are not checking the sources for further information. Guess what though, the same applies to "real" encyclopedias too. One difference is that with "real" encyclopedias, you always lack revision information, and you often lack information about the sources used by the editors. (Some encyclopedias are better than others in that respect.)
  • by phliar ( 87116 ) on Tuesday August 14, 2007 @12:27AM (#20221185) Homepage
    For a change it's not just a link to a .tar.gz somewhere, but an actual article where he goes through what he did, and (more important) why he did things that way. Good reading even if you don't want an off-line Wikipedia.
  • Re:Why? (Score:5, Insightful)

    by thePsychologist ( 1062886 ) on Tuesday August 14, 2007 @01:05AM (#20221433) Journal
    Realize that some of the greatest things done by humankind were from doing "pointless projects" as you call them. Prime numbers for instance were studied by mathematicians just for fun, and now look, they're used for cryptography. Try doing your banking without them.

    Complex numbers originated from something "useless" like trying to solve the quartic polynomial in radicals...try building a bridge without them. In fact all of science is built upon people going in random tangents doing things they enjoy, discovering seemingly "useless facts" but most of it becomes useful *and* gives us an idea of the universe in which we live.

    Only working on immediate practical problems is very shortsighted, and if mandated throughout the academic community, would mean the death of innovation and most discoveries.
  • Re:Ho-Hum ... (Score:5, Insightful)

    by OzRoy ( 602691 ) on Tuesday August 14, 2007 @01:18AM (#20221501)
    Auto-update would be interesting. How do you keep the data up to date without downloading the entire 2.9G again? Is there some sort of diff file you can download?
  • I know the feeling (Score:5, Insightful)

    by aepervius ( 535155 ) on Tuesday August 14, 2007 @01:30AM (#20221543)
    They say to you that their hobby is painting/music/walking/repairing old car/gardening/making reduced model etc... And they seem to think that their hobby are perfectly acceptable. But as soon as you say you like to program stuff, they don't understand how this would be a hobby. They mostly fail to recognize that every one of us has something in common : the joy of act of creation. The fact that our hobby entail creating something immaterial and full of "logic" does not matter. It is still a joy.
  • by cowens ( 30752 ) on Tuesday August 14, 2007 @03:06AM (#20221949)
    Ah, but that is what the original HHGTTG was as well. Tons of info on alcohol and Eccentricea Gallumbits (the triple breasted whore of Eroticon Six), but the entry for Earth was: Harmless. Later it was expanded: Mostly Harmless.
  • Re:The Point? (Score:5, Insightful)

    by Mr. Roadkill ( 731328 ) on Tuesday August 14, 2007 @03:26AM (#20222067)

    I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?
    Ummm... I think the whole point is, as you've pointed out, that not everyone has a permanent connection to the net everywhere they go. Or maybe they don't have access to everything they'd like even if they *do* have net access everywhere, or want to pay extravagant data rates while out and about.

    Joe has all-you-can-eat broadband at home, or an understanding employer with a fat pipe, and spends two hours each day on the train. Two and a half gig per month (and lets face it, you probably don't want to update it more frequently that that) and he's got probably half his reading material sorted out.

    Wang lives in Buttfuckistan, a fictional country with totalitarian leanings with too many real-world counterparts. The Great Firewall of Buttfuckistan (i.e. squidguard, under the control of Buttfuckistan Telecom, and settings in the routers to drop non-port-80 traffic half the time) makes it impossible to reliably access Wikipedia from inside their borders, which is a great shame because the entry on Buttfuckistan is particularly unflattering. Once a month, Joe sticks a DVD with five minutes from an old re-run of Friends and an encrypted dump of Wikipedia in an airmail envelope and sends it to Wang.

    Mary is still at secondary school, and her particular school has wifi access for students who are encouraged to purchase their own laptops, but since the local pastor discovered http://en.wikipedia.org/wiki/Image:Dream_of_the_fi shermans_wife_hokusai.jpg [wikipedia.org] they've been forced to add wikipedia to the school's blocklist. Which is a pity, because it's a great first-approximation source for material or research directions, but there you go. Mary can make a local copy through her home broadband connection, and can access it locally on her laptop wherever she goes - even at school, or church. Bill, Jillian and Mungo (the pastor's son) find out about this, and now all four of them take it in turns to make the copy each month, sharing the bandwidth costs. Their friends Harry and Sally, who don't have broadband but are great friends of the other four, also get copies... and there are plans to distribute the copies further, as a kind of teenage grass-roots knowledge-sharing and social-justice effort.

    Still can't see the point?
  • by dannycim ( 442761 ) on Tuesday August 14, 2007 @05:25AM (#20222493)
    There's a serious problem with the article's way of treating the data that I didn't see addressed.

    The wikipedia database file is one large bzip2'ed XML file which the author splits into blocks of 900k (bzip2's natural blocking) which he then parses for the "title" and "text" XML tags.

    The problem with that approach is that some of these tags may well end up being split over block boundaries, so some articles risk being missed. EG:

    END-OF-BLOCK: blablablabla...blabla[/text][othertag][ti

    START-OF-NEXT-BLOCK: tle][sometag]blablablablabla...

    So searching for "[title]" in boths blocks separately like TFA does will fail for one article.

    (I've used square brackets instead of lessthans and greaterthans because slashdot won't let me use them.)

"A car is just a big purse on wheels." -- Johanna Reynolds

Working...