Building a Fast Wikipedia Offline Reader

Building a Fast Wikipedia Offline Reader 208

Posted by kdawson on Monday August 13, 2007 @10:53PM from the you-could-look-it-up dept.

ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."

Building a Fast Wikipedia Offline Reader

This discussion has been archived. No new comments can be posted.

Search 208 Comments Log In/Create an Account

Comments Filter:

Re:2X (Score:5, Informative)

by Brian Gordon ( 987471 ) writes: on Tuesday August 14, 2007 @12:24AM (#20221161)

Ahaha, 2.9GB? That's the text alone. Images will net you more than 200GB [wikimedia.org] more. And yes, you do need a LAMP/WAMP and working mediawiki, but it wouldn't take 'days' it would take a few hours max. Also is this guy aware that wikipedia is available on DVD [wikipedia.org] already?

Re:Uh.... (Score:4, Informative)

by stephanruby ( 542433 ) writes: on Tuesday August 14, 2007 @12:38AM (#20221265)

Programming something new to some people is like playing a video game.
Speaking of which, http://www.pyweek.org/ [pyweek.org] is coming up this first week of September. It's time to dust off that python book (or borrow one from someone) and do whatever you have to do to get some days off that week.

Re:2X (Score:5, Informative)

by TubeSteak ( 669689 ) writes: on Tuesday August 14, 2007 @01:07AM (#20221457) Journal

Also is this guy aware that wikipedia is available on DVD already?
Are you aware that the link you pointed to (1) is not the same thing as the link (2) the author pointed to?
(1) http://schools-wikipedia.org/ [schools-wikipedia.org]
(2) http://download.wikimedia.org/enwiki/latest/ [wikimedia.org]

1 is 4625 articles hand picked for school age children, hence the website name
2 is a straight dump of wikipedia

Just imagine my surprise when the schools-wikipedia website didn't have the wiki article on Goatse!

Comment removed (Score:5, Informative)

by account_deleted ( 4530225 ) writes: on Tuesday August 14, 2007 @01:10AM (#20221467)

Comment removed based on user account deletion

Mass inserts into mysql... (Score:4, Informative)

by Splab ( 574204 ) writes: on Tuesday August 14, 2007 @01:32AM (#20221557)

is very very slow when you do it on a normal installation, the reason is MySQL comes with a "be nice to people who don't know what they are doing" setup. Go into the my.cnf and find the buffer settings, crank them up and restart the server. It can really do a lot (especially if you are running InnoDB which you of course are since MyISAM isn't a proper database).

SDHC? (Score:5, Informative)

by tepples ( 727027 ) writes: <tepplesNO@SPAMgmail.com> on Tuesday August 14, 2007 @07:03AM (#20222863) Homepage Journal

Why? Have you seen the price of 4G flash cards recently?
Yes, and it's possible that you may need a new PDA in order to use SD cards larger than 2 GB. The 4 GB ones use a different protocol called SDHC [pocketpccentral.net] that older PDAs may not support. It's analogous to the old ATA hard disk size barriers [pcguide.com], especially the 137 GB (128 GiB) barrier [pcguide.com]. Or are most PDAs capable of being upgraded to handle SDHC?

better yet, a DS version (Score:4, Informative)

by tepples ( 727027 ) writes: <tepplesNO@SPAMgmail.com> on Tuesday August 14, 2007 @07:23AM (#20222965) Homepage Journal

A PSP is very portable (fits in your sweater/backpack), hackable
You have to buy a used PSP to be sure that you can hack it. New ones are likely to come with firmware version 3.51 or later, which is not cracked as of August 2007. The Nintendo DS, on the other hand, had its last major firmware update in September 2005 and is still cracked, with SLOT-1 modchips available at Wal-Mart for $30.
and has up to 8Gb of storage
So does a CompactFlash card in a GBA Movie Player in a Nintendo DS. It's a pity that the SLOT-1 adapters for DS haven't been shown to be compatible with SDHC.

Re:Days? Please clarify (Score:3, Informative)

by pla ( 258480 ) writes: on Tuesday August 14, 2007 @07:48AM (#20223093) Journal

Do you mean searching takes days, or loading? Searching should be quick if you index the words. If you are duplicating a bunch of local clones of wiki, then simply copy down the raw MySql table data files rather than reload from delimited files etc. (One needs to make sure their version of MySql is compatible with the table file format.)

I suspect the former, plus creating the index, plus the not inconsiderable overhead of running an SQL server.

DBs have their place. For a "real" Wiki, or more generally any data collection scenario where you can have a designated server, using a SQL store makes perfect sense.

In most situations, however, the overhead of running a real database on the end-user's machine makes no sense (for the record, I consider this one of the biggest non-bug flaws with Vista, though I realize you can technically turn it off - With the resulting loss of functionality). The exact project mentioned in the FP forms the perfect example of this - He doesn't want to run a Wiki, he just wants to take a dump of it, do text-based searching, and extract pages in some reasonable form. Why would he want to even consider importing a nice single XML file into a memory-hungry form, from which he would still need a set of frontend tools to extract the desired data and convert it to a convenient viewable form?

Re:SDHC? (Score:3, Informative)

by SCHecklerX ( 229973 ) writes: <greg@gksnetworks.com> on Tuesday August 14, 2007 @11:04AM (#20224997) Homepage

You can get a normal 4GB SD card from Transcend. I am using one in my sandisk sansa e140, which is definitely NOT SDHC compatible.

Re:There's a bug in TFA: Missing articles. (Score:4, Informative)

by ttsiod ( 881575 ) writes: on Tuesday August 14, 2007 @12:53PM (#20226505) Homepage

(Re-post: for some reason the response I sent some hours ago didn't appear) No, actually there is no bug. If you read the contents of the 'show.pl' script, you'll see that it adapts to a missing '</text>' by reading from the next volume - the next recxxx...bz2.
As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space. Since "text" and "title" appear all the time in these blocks (at least once for each article), they will NOT be split - they will be encoded as "references", and therefore, what you describe shouldn't happen (I hope :-)

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Building a Fast Wikipedia Offline Reader 208

Building a Fast Wikipedia Offline Reader More Login

Building a Fast Wikipedia Offline Reader

Re:2X (Score:5, Informative)

Re:Uh.... (Score:4, Informative)

Re:2X (Score:5, Informative)

Comment removed (Score:5, Informative)

Mass inserts into mysql... (Score:4, Informative)

SDHC? (Score:5, Informative)

better yet, a DS version (Score:4, Informative)

Re:Days? Please clarify (Score:3, Informative)

Re:SDHC? (Score:3, Informative)

Re:There's a bug in TFA: Missing articles. (Score:4, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot