Slashdot Log In
Building a Fast Wikipedia Offline Reader
Posted by
kdawson
on Mon Aug 13, 2007 09:53 PM
from the you-could-look-it-up dept.
from the you-could-look-it-up dept.
ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."
Related Stories
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Wow! (Score:3, Funny)
Re: (Score:3, Insightful)
Re:Why? (Score:5, Funny)
Parent
Just settle it the old way (Score:5, Funny)
Parent
Re: (Score:3, Funny)
Take that, Mr Obviously A. Troll! (Score:5, Funny)
Parent
Re: (Score:2, Funny)
Re:Take that, Mr Obviously A. Troll! (Score:5, Funny)
Parent
Re: (Score:2)
Ironically, You're already reading slashdot. You had just wasted your time.
You're reading it again, wasting more time ehh??....
But the point is, if programming an offline wikipedia makes you happy and you don't need the money then you would understand....
Re:Why? (Score:5, Insightful)
Complex numbers originated from something "useless" like trying to solve the quartic polynomial in radicals...try building a bridge without them. In fact all of science is built upon people going in random tangents doing things they enjoy, discovering seemingly "useless facts" but most of it becomes useful *and* gives us an idea of the universe in which we live.
Only working on immediate practical problems is very shortsighted, and if mandated throughout the academic community, would mean the death of innovation and most discoveries.
Parent
SDHC? (Score:5, Informative)
Parent
Re: (Score:3, Informative)
Ho-Hum ... (Score:5, Funny)
Let us know when you're ready for prime time
Re:Ho-Hum ... (Score:5, Insightful)
Parent
Re: (Score:3, Funny)
Not too hard if you have a sub-etha net connection handy. Better check that the article about The Earth which you have been working on hasn't been cut down to two words though.
I hope (Score:4, Funny)
George W Bush
Is a dick head!!!!11
Re:I hope (Score:5, Funny)
Oh, nevermind, I see the problem:
George W Bush
Is a dick head!!!!11
should be
George W Bush
Is a dick head!!!!!!
Man, those out to mess with the content are getting more and more subtle...
Parent
But... (Score:2, Funny)
Hitchhiker's guide here we come! (Score:5, Funny)
Re:Hitchhiker's guide here we come! (Score:5, Funny)
Parent
Re:Hitchhiker's guide here we come! (Score:5, Funny)
Parent
Re:Hitchhiker's guide here we come! (Score:5, Insightful)
Parent
Re:Hitchhiker's guide here we come! (Score:5, Funny)
Parent
Re: (Score:2)
Re:Hitchhiker's guide here we come! (Score:5, Funny)
* 1. It is slightly cheaper
* 2. It has the words "You can copy and edit me for free" inscribed in large friendly letters in the license.
Also like the guide, although it cannot hope to be useful or informative on all matters, it does make the reassuring claim that where it is inaccurate, it is at least definitively inaccurate
Parent
Only 2 days huh (Score:2, Funny)
Good part of the page: the explanation (Score:5, Insightful)
It doesn't take days (Score:5, Informative)
Use the ANSI C implementation, which takes about 20 minutes to convert the XML to SQL and then takes a few hours to import into MySQL. Please not that you need a properly configured MySQL server in order to efficiently run a local copy of Wikipedia, which must have at least 8GB of ram.
http://meta.wikimedia.org/wiki/Xml2sql [wikimedia.org]
Mass inserts into mysql... (Score:4, Informative)
What?? (Score:5, Funny)
1. Not a thinly-veiled attempt to advertise a crappy product
2. Not bashing Microsoft
3. Not about somebody who is trolling open-source (i.e. SCO)
4. Not about Bush taking away all our rights and ending freedom
5. Not about voting fraud and the end of democracy/America/the world
6. Not decrying Vista DRM and its ties to the MAFIAA
7. Posted on Slashdot
Furthermore, TFA is interesting and informative.
Am I in heaven?
Re:What?? (Score:5, Funny)
Parent
can we get a PSP version of it? (Score:4, Interesting)
better yet, a DS version (Score:4, Informative)
Parent
There's a bug in TFA: Missing articles. (Score:5, Insightful)
The wikipedia database file is one large bzip2'ed XML file which the author splits into blocks of 900k (bzip2's natural blocking) which he then parses for the "title" and "text" XML tags.
The problem with that approach is that some of these tags may well end up being split over block boundaries, so some articles risk being missed. EG:
END-OF-BLOCK: blablablabla...blabla[/text][othertag][ti
START-OF-NEXT-BLOCK: tle][sometag]blablablablabla...
So searching for "[title]" in boths blocks separately like TFA does will fail for one article.
(I've used square brackets instead of lessthans and greaterthans because slashdot won't let me use them.)
Re:There's a bug in TFA: Missing articles. (Score:4, Informative)
As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space. Since "text" and "title" appear all the time in these blocks (at least once for each article), they will NOT be split - they will be encoded as "references", and therefore, what you describe shouldn't happen (I hope :-)
Parent
Re:Uh.... (Score:5, Interesting)
Parent
Re: (Score:2, Insightful)
Have you ever worked on a project called "Clippey", by chance?
Re:Uh.... (Score:5, Funny)
Have you ever worked on a project called "Clippey", by chance?
Parent
Re:Uh.... (Score:4, Informative)
Parent
I know the feeling (Score:5, Insightful)
Parent
Re: (Score:3, Interesting)
Its funny how time changes you.
Re: (Score:2)
(Or to always have something to read on your laptop while traveling - this is what I would use it for)
Re: (Score:3, Funny)
(Or to always have something to read on your laptop while traveling - this is what I would use it for)
Sorry, I couldn't resist!
Re:Just hope you don't get an effed image. (Score:5, Insightful)
Parent
Re:Just hope you don't get an effed image. (Score:4, Funny)
Parent
Re: (Score:3, Funny)
Re:2X (Score:5, Informative)
Parent
Re:2X (Score:5, Informative)
(1) http://schools-wikipedia.org/ [schools-wikipedia.org]
(2) http://download.wikimedia.org/enwiki/latest/ [wikimedia.org]
1 is 4625 articles hand picked for school age children, hence the website name
2 is a straight dump of wikipedia
Just imagine my surprise when the schools-wikipedia website didn't have the wiki article on Goatse!
Parent
Re:The Point? (Score:5, Insightful)
Joe has all-you-can-eat broadband at home, or an understanding employer with a fat pipe, and spends two hours each day on the train. Two and a half gig per month (and lets face it, you probably don't want to update it more frequently that that) and he's got probably half his reading material sorted out.
Wang lives in Buttfuckistan, a fictional country with totalitarian leanings with too many real-world counterparts. The Great Firewall of Buttfuckistan (i.e. squidguard, under the control of Buttfuckistan Telecom, and settings in the routers to drop non-port-80 traffic half the time) makes it impossible to reliably access Wikipedia from inside their borders, which is a great shame because the entry on Buttfuckistan is particularly unflattering. Once a month, Joe sticks a DVD with five minutes from an old re-run of Friends and an encrypted dump of Wikipedia in an airmail envelope and sends it to Wang.
Mary is still at secondary school, and her particular school has wifi access for students who are encouraged to purchase their own laptops, but since the local pastor discovered http://en.wikipedia.org/wiki/Image:Dream_of_the_f
Still can't see the point?
Parent
Re: (Score:3, Informative)
I suspect the former, plus creating the index, plus the not inconsiderable overhead of running an SQL server.
DBs have their place. For a "real" Wiki, or more generally