Building a Fast Wikipedia Offline Reader 208
ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."
Re:2X (Score:5, Informative)
Re:Uh.... (Score:4, Informative)
Re:2X (Score:5, Informative)
(1) http://schools-wikipedia.org/ [schools-wikipedia.org]
(2) http://download.wikimedia.org/enwiki/latest/ [wikimedia.org]
1 is 4625 articles hand picked for school age children, hence the website name
2 is a straight dump of wikipedia
Just imagine my surprise when the schools-wikipedia website didn't have the wiki article on Goatse!
Comment removed (Score:5, Informative)
Mass inserts into mysql... (Score:4, Informative)
SDHC? (Score:5, Informative)
better yet, a DS version (Score:4, Informative)
Re:Days? Please clarify (Score:3, Informative)
I suspect the former, plus creating the index, plus the not inconsiderable overhead of running an SQL server.
DBs have their place. For a "real" Wiki, or more generally any data collection scenario where you can have a designated server, using a SQL store makes perfect sense.
In most situations, however, the overhead of running a real database on the end-user's machine makes no sense (for the record, I consider this one of the biggest non-bug flaws with Vista, though I realize you can technically turn it off - With the resulting loss of functionality). The exact project mentioned in the FP forms the perfect example of this - He doesn't want to run a Wiki, he just wants to take a dump of it, do text-based searching, and extract pages in some reasonable form. Why would he want to even consider importing a nice single XML file into a memory-hungry form, from which he would still need a set of frontend tools to extract the desired data and convert it to a convenient viewable form?
Re:SDHC? (Score:3, Informative)
Re:There's a bug in TFA: Missing articles. (Score:4, Informative)
As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space. Since "text" and "title" appear all the time in these blocks (at least once for each article), they will NOT be split - they will be encoded as "references", and therefore, what you describe shouldn't happen (I hope :-)