Become a fan of Slashdot on Facebook

Building a Fast Wikipedia Offline Reader 208

Posted by kdawson on Monday August 13, 2007 @10:53PM from the you-could-look-it-up dept.

ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."

This discussion has been archived. No new comments can be posted.

Building a Fast Wikipedia Offline Reader

Load All Comments

Search 208 Comments Log In/Create an Account

Comments Filter:

Wow! (Score:3, Funny)

by ferrocene ( 203243 ) writes: on Monday August 13, 2007 @10:56PM (#20220565) Journal

After doing all that, I think you may have missed your flight! :)

Share
twitter facebook
- - Re: (Score:3, Insightful)
    
    by bn557 ( 183935 ) writes:
    
    This may seem like a stupid, trivial, and pointless project, but the programmer may have gained something from it that he could use later in something you don't feel that way about. If the programmer enjoyed doing it, that might have lead to a more productive coding session later in the day too.
  - Re:Why? (Score:5, Funny)
    
    by rabblerabble ( 884373 ) writes: on Monday August 13, 2007 @11:20PM (#20220707)
    
    I'll bite...Unfortunately, I don't have a basement, so therefore there are times that I am required to venture into the outer realm that happens to be heated by the big ball of gas known as Sol, as opposed to a pump ;P Seriously though, this is exactly what I have been looking for. What better way to show up your friends when they cry "You're wrong, google it!" knowing that there is no connection possible within twenty miles. Next time i'm drunk at the beach and someone wants to pretend to know the history of coffee harvesting, it's on.
    
    Parent Share
    twitter facebook
    - Just settle it the old way (Score:5, Funny)
      
      by EmbeddedJanitor ( 597831 ) writes: on Monday August 13, 2007 @11:22PM (#20220725)
      
      Kick sand in their face!
      
      Parent Share
      twitter facebook
      - Re: (Score:3, Funny)
        
        by rabblerabble ( 884373 ) writes:
        
        The goggles would work then. Your logic is flawed.
    - - Re: (Score:2)
        
        by 644bd346996 ( 1012333 ) writes:
        
        Actually, the only heated intellectual debates I regularly get into are on camping trips, where cellular reception is very rare. Worse yet, my friends often expect me to have encyclopedic knowledge with which to settle such debates.
        
        Conversations around a campfire can go anywhere.
  - Take that, Mr Obviously A. Troll! (Score:5, Funny)
    
    by ampathee ( 682788 ) writes: on Monday August 13, 2007 @11:22PM (#20220719)
    
    Programmers shouldn't be wasting time on these trivial, pointless projects. We need their work in other more important projects!
    Hah! I'm going to start work on (let's see..) a random lolcat generator now, just to piss you off.
    
    Parent Share
    twitter facebook
    - Re: (Score:2, Funny)
      
      by SoapDish ( 971052 ) writes:
      
      Make sure to write it in LOL-CODE! (http://lolcode.com/ [lolcode.com])
    - Re:Take that, Mr Obviously A. Troll! (Score:5, Funny)
      
      by MarkRose ( 820682 ) writes: on Tuesday August 14, 2007 @12:52AM (#20221371) Homepage
      
      You mean something like lolcatgenerator.com [lolcatgenerator.com]? Looks like someone already tackled that important project! lol
      
      Parent Share
      twitter facebook
    - Re: (Score:2)
      
      by SpooForBrains ( 771537 ) writes:
      
      This makes me want to tackle that Global Toilet Database project I've been planning since I was 12 (at which point it was going to be a book)
    - Re: (Score:2)
      
      by imsabbel ( 611519 ) writes:
      
      If you REALLY want to piss him off, try writing the generator in LOLCODE (http://lolcode.com/)
  - Re: (Score:2)
    
    by RuBLed ( 995686 ) writes:
    
    Programmers shouldn't be wasting time on these trivial, pointless projects. We need their work in other more important projects!
    Ironically, You're already reading slashdot. You had just wasted your time.
    
    You're reading it again, wasting more time ehh??....
    
    But the point is, if programming an offline wikipedia makes you happy and you don't need the money then you would understand....
  - Re:Why? (Score:5, Insightful)
    
    by thePsychologist ( 1062886 ) writes: on Tuesday August 14, 2007 @01:05AM (#20221433) Journal
    
    Realize that some of the greatest things done by humankind were from doing "pointless projects" as you call them. Prime numbers for instance were studied by mathematicians just for fun, and now look, they're used for cryptography. Try doing your banking without them.
    
    Complex numbers originated from something "useless" like trying to solve the quartic polynomial in radicals...try building a bridge without them. In fact all of science is built upon people going in random tangents doing things they enjoy, discovering seemingly "useless facts" but most of it becomes useful *and* gives us an idea of the universe in which we live.
    
    Only working on immediate practical problems is very shortsighted, and if mandated throughout the academic community, would mean the death of innovation and most discoveries.
    
    Parent Share
    twitter facebook
  - Re: (Score:2)
    
    by Tweekster ( 949766 ) writes:
    
    Because I can
    
    which is one of the greatest motivations for human advancements.
- - Re: (Score:2)
    
    by CoolVibe ( 11466 ) writes:
    
    Be by guest... You do need that 2.9 GB file somewhere on your PDA first. That just *might* be an issue.
    - Re: (Score:2)
      
      by Eunuchswear ( 210685 ) writes:
      
      Be by guest... You do need that 2.9 GB file somewhere on your PDA first. That just *might* be an issue.
      
      Why? Have you seen the price of 4G flash cards recently?
      - Re: (Score:2)
        
        by CoolVibe ( 11466 ) writes:
        
        Yes, in the USA... I live in the Netherlands where 2GB flash is just getting cheap.
      - SDHC? (Score:5, Informative)
        
        by tepples ( 727027 ) writes: <tepples.gmail@com> on Tuesday August 14, 2007 @07:03AM (#20222863) Homepage Journal
        
        Why? Have you seen the price of 4G flash cards recently?
        Yes, and it's possible that you may need a new PDA in order to use SD cards larger than 2 GB. The 4 GB ones use a different protocol called SDHC [pocketpccentral.net] that older PDAs may not support. It's analogous to the old ATA hard disk size barriers [pcguide.com], especially the 137 GB (128 GiB) barrier [pcguide.com]. Or are most PDAs capable of being upgraded to handle SDHC?
        
        Parent Share
        twitter facebook
        
        Re: (Score:3, Informative)
        
        by SCHecklerX ( 229973 ) writes:
        
        You can get a normal 4GB SD card from Transcend. I am using one in my sandisk sansa e140, which is definitely NOT SDHC compatible.
Ho-Hum ... (Score:5, Funny)

by jabberwock ( 10206 ) writes: on Monday August 13, 2007 @11:21PM (#20220715) Homepage

What, no auto update? No User Agreement? No disabled features that are enabled by a mammoth key? No product registration?

Let us know when you're ready for prime time ... ;-)

Share
twitter facebook
- Re:Ho-Hum ... (Score:5, Insightful)
  
  by OzRoy ( 602691 ) writes: on Tuesday August 14, 2007 @01:18AM (#20221501)
  
  Auto-update would be interesting. How do you keep the data up to date without downloading the entire 2.9G again? Is there some sort of diff file you can download?
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Funny)
    
    by MichaelSmith ( 789609 ) writes:
    
    How do you keep the data up to date without downloading the entire 2.9G again?
    
    Not too hard if you have a sub-etha net connection handy. Better check that the article about The Earth which you have been working on hasn't been cut down to two words though.
  - - Re: (Score:2)
      
      by superpulpsicle ( 533373 ) writes:
      
      It said the entire wikipedia package is Size : 4.37GB. All those articles and images only come out to this size?
I hope (Score:4, Funny)

by Nikron ( 888774 ) writes: on Monday August 13, 2007 @11:36PM (#20220817)

That you don't dump the wiki at a bad time.
George W Bush
Is a dick head!!!!11

Share
twitter facebook
- Re:I hope (Score:5, Funny)
  
  by Anonymous Coward writes: on Monday August 13, 2007 @11:58PM (#20220979)
  
  You mean before someone makes it inaccurate again?
  
  Oh, nevermind, I see the problem:
  
  George W Bush
  
  Is a dick head!!!!11
  
  should be
  
  George W Bush
  
  Is a dick head!!!!!!
  
  Man, those out to mess with the content are getting more and more subtle...
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Ours ( 596171 ) writes:
  
  Looks fine by me.
But... (Score:2, Funny)

by Anonymous Coward writes:

What's the point of it if there are no vandals or flame wars to make it interesting?
Hitchhiker's guide here we come! (Score:5, Funny)

by Brietech ( 668850 ) writes: on Monday August 13, 2007 @11:45PM (#20220879)

Combine this and one of the new E-ink ebook readers, make it pretty rugged, slap a solar panel on the back and man. . . you have something really close to a genuine hitchhiker's guide to the galaxy. Ah, I love where technology is heading =)

Share
twitter facebook
- Re:Hitchhiker's guide here we come! (Score:5, Funny)
  
  by Sneftel ( 15416 ) writes: on Monday August 13, 2007 @11:55PM (#20220949)
  
  As long as hitchhikers primarily need to know how to evolve a Pikachu into a Raichu, and how Benjamin Disraeli has been referenced in pop culture.
  
  Parent Share
  twitter facebook
  - Re:Hitchhiker's guide here we come! (Score:5, Funny)
    
    by RandomWhiteMan ( 685768 ) writes: on Tuesday August 14, 2007 @12:08AM (#20221051)
    
    You laugh now, but just wait until you're stranded in the middle of Blackheath England, needing a ride from a conservative British History Scholar who has his son with him playing Pokemon Gold. Won't be so smug then, will you. I bet you won't even have your towel on you when this all goes down.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by Aardpig ( 622459 ) writes:
      
      If you're stranded in the middle of Blackheath, I suggest a five minute walk over to the Hare and Billet, for a pint and some peanuts, would be best.
  - Re:Hitchhiker's guide here we come! (Score:5, Insightful)
    
    by cowens ( 30752 ) writes: on Tuesday August 14, 2007 @03:06AM (#20221949)
    
    Ah, but that is what the original HHGTTG was as well. Tons of info on alcohol and Eccentricea Gallumbits (the triple breasted whore of Eroticon Six), but the entry for Earth was: Harmless. Later it was expanded: Mostly Harmless.
    
    Parent Share
    twitter facebook
  - Re:Hitchhiker's guide here we come! (Score:5, Funny)
    
    by nstlgc ( 945418 ) writes: on Tuesday August 14, 2007 @03:31AM (#20222091)
    
    Just so we're clear, you can make Pikachu evolve into Raichu by using the Thunder Stone (which makes sense, since they're Electric Pokémon). However, due to the emotional value Pikachu has to trainers, most of them choose not to evolve him. Some Pokémon games even plain don't allow this. I hope this was helpful.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by creepynut ( 933825 ) writes:
      
      Re:Hitchhiker's guide here we come! (Score:3, Informative)
      
      This is why I love Slashdot. Even Pokemon trainers will take time out of their days to moderate Slashdot, proving the system works.
  - Re: (Score:2)
    
    by Frozen Void ( 831218 ) writes:
    
    I have a nagging suspicion that people like you have finally made wikipedia to merge all pokemons in batch articles.
- Re: (Score:2)
  
  by dch24 ( 904899 ) writes:
  
  Don't forget to put this on the cover, in large reassuring letters:
  DON'T PANIC
- Re:Hitchhiker's guide here we come! (Score:5, Funny)
  
  by Gromius ( 677157 ) writes: on Tuesday August 14, 2007 @02:53AM (#20221905)
  
  Yes its a perfect fit. Particularly as Wikipedia has now supplanted the Encyclopedia Britannica in many places as the standard repository of all knowledge and wisdom. Although it has many omissions, contains much that is apocryphal, or at least widely inaccurate, it scores over the older more pedestrian work in two important ways.
  
  * 1. It is slightly cheaper
  * 2. It has the words "You can copy and edit me for free" inscribed in large friendly letters in the license.
  
  Also like the guide, although it cannot hope to be useful or informative on all matters, it does make the reassuring claim that where it is inaccurate, it is at least definitively inaccurate :)
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by MichaelSmith ( 789609 ) writes:
    
    You deserved a funny mod. Thanks for that. I feel happy and sad at the same time.
Only 2 days huh (Score:2, Funny)

by Anonymous Coward writes:

I was able to build this in two days, most of which were spent searching for the appropriate tools. Simply unbelievable... toying around with these tools and writing less than 200 lines of code, and... presto!
Give that man a job at Google.
- Re: (Score:2, Funny)
  
  by dmdavis ( 949140 ) writes:
  
  Sorry, but he never states that his product is in beta.
Good part of the page: the explanation (Score:5, Insightful)

by phliar ( 87116 ) writes: on Tuesday August 14, 2007 @12:27AM (#20221185) Homepage

For a change it's not just a link to a .tar.gz somewhere, but an actual article where he goes through what he did, and (more important) why he did things that way. Good reading even if you don't want an off-line Wikipedia.

Share
twitter facebook
Comment removed (Score:5, Informative)

by account_deleted ( 4530225 ) writes: on Tuesday August 14, 2007 @01:10AM (#20221467)

Comment removed based on user account deletion

Share
twitter facebook
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: (Score:2)
    
    by jschrod ( 172610 ) writes:
    
    OK, I bite.
    Your URL leads to a domain parking page. Google search for Wikistick didn't bring results on the first page either. AFAIK, full Wikipedia (text and images) is too large for a USB stick.
    What did you want to tell us?
    - Re: (Score:2)
      
      by gbjbaanb ( 229885 ) writes:
      
      is it? [mygeeknewz.com] I read the DB is 2.9Gb which is all he's taken. You don't *need* the images, but I image they'd take a fair amount of space!
- Re: (Score:2)
  
  by Ant P. ( 974313 ) writes:
  
  Please not that you need a properly configured MySQL server in order to efficiently run a local copy of Wikipedia, which must have at least 8GB of ram.
  In other words, your post is a complete waste of space given that he said in TFS that this is for a laptop.
Mass inserts into mysql... (Score:4, Informative)

by Splab ( 574204 ) writes: on Tuesday August 14, 2007 @01:32AM (#20221557)

is very very slow when you do it on a normal installation, the reason is MySQL comes with a "be nice to people who don't know what they are doing" setup. Go into the my.cnf and find the buffer settings, crank them up and restart the server. It can really do a lot (especially if you are running InnoDB which you of course are since MyISAM isn't a proper database).

Share
twitter facebook
- Re: (Score:2)
  
  by Ant P. ( 974313 ) writes:
  
  Neither is SQLite, but that doesn't mean nobody should use it ever.
What?? (Score:5, Funny)

by icydog ( 923695 ) writes: on Tuesday August 14, 2007 @01:48AM (#20221617) Homepage

TFA is:

1. Not a thinly-veiled attempt to advertise a crappy product
2. Not bashing Microsoft
3. Not about somebody who is trolling open-source (i.e. SCO)
4. Not about Bush taking away all our rights and ending freedom
5. Not about voting fraud and the end of democracy/America/the world
6. Not decrying Vista DRM and its ties to the MAFIAA
7. Posted on Slashdot

Furthermore, TFA is interesting and informative.

Am I in heaven?

Share
twitter facebook
- Re:What?? (Score:5, Funny)
  
  by Pollardito ( 781263 ) writes: on Tuesday August 14, 2007 @09:16AM (#20223779)
  
  it'll get posted again tomorrow just to maintain expectations
  
  Parent Share
  twitter facebook
The Point? (Score:2)

by photomonkey ( 987563 ) writes:

I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?

The beauty of it is that it is online and always up-to-date (wrong, or less wrong).

Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.

If it's an academic project, that's really cool, but I don't see a practical point to it.
- Re:The Point? (Score:5, Insightful)
  
  by Mr. Roadkill ( 731328 ) writes: on Tuesday August 14, 2007 @03:26AM (#20222067)
  
  I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?
  Ummm... I think the whole point is, as you've pointed out, that not everyone has a permanent connection to the net everywhere they go. Or maybe they don't have access to everything they'd like even if they *do* have net access everywhere, or want to pay extravagant data rates while out and about.
  
  Joe has all-you-can-eat broadband at home, or an understanding employer with a fat pipe, and spends two hours each day on the train. Two and a half gig per month (and lets face it, you probably don't want to update it more frequently that that) and he's got probably half his reading material sorted out.
  
  Wang lives in Buttfuckistan, a fictional country with totalitarian leanings with too many real-world counterparts. The Great Firewall of Buttfuckistan (i.e. squidguard, under the control of Buttfuckistan Telecom, and settings in the routers to drop non-port-80 traffic half the time) makes it impossible to reliably access Wikipedia from inside their borders, which is a great shame because the entry on Buttfuckistan is particularly unflattering. Once a month, Joe sticks a DVD with five minutes from an old re-run of Friends and an encrypted dump of Wikipedia in an airmail envelope and sends it to Wang.
  
  Mary is still at secondary school, and her particular school has wifi access for students who are encouraged to purchase their own laptops, but since the local pastor discovered http://en.wikipedia.org/wiki/Image:Dream_of_the_fi shermans_wife_hokusai.jpg [wikipedia.org] they've been forced to add wikipedia to the school's blocklist. Which is a pity, because it's a great first-approximation source for material or research directions, but there you go. Mary can make a local copy through her home broadband connection, and can access it locally on her laptop wherever she goes - even at school, or church. Bill, Jillian and Mungo (the pastor's son) find out about this, and now all four of them take it in turns to make the copy each month, sharing the bandwidth costs. Their friends Harry and Sally, who don't have broadband but are great friends of the other four, also get copies... and there are plans to distribute the copies further, as a kind of teenage grass-roots knowledge-sharing and social-justice effort.
  
  Still can't see the point?
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Riktov ( 632 ) writes:
  
  >>
  The beauty of it is that it is online and always up-to-date (wrong, or less wrong).
  
  Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.
  >>
  
  Sure, what's the point of reading an old version of the history of the Battle of Hastings, or the technical specifications of the P-51 Mustang, or the characteristics of a dominant seventh chord? After six months, it's complete obsolete and worthless, right?
What about moulin? (Score:2)

by maubp ( 303462 ) writes:

How is this different to moulin which is a fully interactive, offline version of the entire Wikipedia (without pictures) on a CD-ROM:

http://moulinwiki.org/l/en/ [moulinwiki.org]
- - Re: (Score:2)
    
    by maubp ( 303462 ) writes:
    
    Now I've read both articles:
    This guy's work required about 3GB for the compressed Wikipedia data dump (split up into compressed chunks using bzip2recover), plus python, perl, a little database library (xapian) and a web server (Django). He seems to be working in English only, and doesn't seem to provide a "why" or who this might be useful to.
    Moulin has a concrete aim in mind, they are starting with the much smaller French version of Wikipedia, and have built a CD-ROM sized offline viewer for released in
    - Re: (Score:2)
      
      by LordSnooty ( 853791 ) writes:
      
      Is Moulin anything to do with Kiwix [kiwix.org]? Cos these guys were also building an offline WP viewer, though only featuring 2000 important articles. Dev seems to have stopped now, a pity as it was a nice package with an excellent page viewer. Ideal for slapping on a laptop and providing something to read when you're away from net.
    - Re: (Score:2)
      
      by Peganthyrus ( 713645 ) writes:
      
      and doesn't seem to provide a "why" or who this might be useful to
      
      Um, anyone who wants to have the entire English version of Wikipedia on their local machine, for those times when they're away from the net?
      
      People who "would love to have Wikipedia on their laptop, since this would allow them to instantly check for things they want regardless of their location (business trips, hotels, etc). Others simply don't have an Internet connection - or they don't want to dial up one every time they need to check someth
...or the HTML export feature? (Score:2)

by georgewilliamherbert ( 211790 ) writes:

There's a one-button (for admins) export-the-whole-wiki-as-html feature in modern MediaWiki software installs...

But hey, two days and a few hundred lines of code is cool. You geek (verb). If we always took the easy way out we'd be using Windows and have committed suicide long ago.
can we get a PSP version of it? (Score:4, Interesting)

by mu22le ( 766735 ) writes: on Tuesday August 14, 2007 @04:28AM (#20222305) Journal

A PSP is very portable (fits in your sweater/backpack), hackable, and has up to 8Gb of storage. I have been dreaming for an year about porting wikipedia to it. Unfortunately I'm not familiar with the kind of programming needed and I could never find the time...

Share
twitter facebook
- better yet, a DS version (Score:4, Informative)
  
  by tepples ( 727027 ) writes: <tepples.gmail@com> on Tuesday August 14, 2007 @07:23AM (#20222965) Homepage Journal
  
  A PSP is very portable (fits in your sweater/backpack), hackable
  You have to buy a used PSP to be sure that you can hack it. New ones are likely to come with firmware version 3.51 or later, which is not cracked as of August 2007. The Nintendo DS, on the other hand, had its last major firmware update in September 2005 and is still cracked, with SLOT-1 modchips available at Wal-Mart for $30.
  and has up to 8Gb of storage
  So does a CompactFlash card in a GBA Movie Player in a Nintendo DS. It's a pity that the SLOT-1 adapters for DS haven't been shown to be compatible with SDHC.
  
  Parent Share
  twitter facebook
- Or a PalmOS version of it? (Score:2)
  
  by Uninvited Guest ( 237316 ) writes:
  
  Good thinking. I was just wondering the same thing about my PalmOS PDA. It has plenty of storage available. I wonder if the existing Python port [sourceforge.net] would be sufficiently powerful to run this.
  - wikipedia for iPod is already here... (Score:2)
    
    by mu22le ( 766735 ) writes:
    
    The thing is, a portable version of wikipedia has been already developed:
    http://encyclopodia.sourceforge.net/en/index.html [sourceforge.net]
    
    for the iPod; also the Encyclopodia Ebook format (basically an indexed b2zipped articles or blocks), is far better suited for portable devices.
    Now if any PSP/DS/Palm developer is reading this...
There's a bug in TFA: Missing articles. (Score:5, Insightful)

by dannycim ( 442761 ) writes: on Tuesday August 14, 2007 @05:25AM (#20222493)

There's a serious problem with the article's way of treating the data that I didn't see addressed.

The wikipedia database file is one large bzip2'ed XML file which the author splits into blocks of 900k (bzip2's natural blocking) which he then parses for the "title" and "text" XML tags.

The problem with that approach is that some of these tags may well end up being split over block boundaries, so some articles risk being missed. EG:

END-OF-BLOCK: blablablabla...blabla[/text][othertag][ti

START-OF-NEXT-BLOCK: tle][sometag]blablablablabla...

So searching for "[title]" in boths blocks separately like TFA does will fail for one article.

(I've used square brackets instead of lessthans and greaterthans because slashdot won't let me use them.)

Share
twitter facebook
- Re: (Score:2)
  
  by dannycim ( 442761 ) writes:
  
  Oh, and I forgot: Articles' text which also end up over block boundaries will appear truncated.
  - - Re: (Score:2)
      
      by dannycim ( 442761 ) writes:
      
      I'm only seeing one call to "ShowTopic" in show.pl, which does the decompression of a block. I thought about the possibility that bzip2 would be unlikely to split words, but there's still the possibility that the title and text tags from one set could be split over two adjacent blocks.
      
      Anyway, gotta go to work. When I come back, I'll do some more in-depth sleuthing. :)
- Re:There's a bug in TFA: Missing articles. (Score:4, Informative)
  
  by ttsiod ( 881575 ) writes: on Tuesday August 14, 2007 @12:53PM (#20226505) Homepage
  
  (Re-post: for some reason the response I sent some hours ago didn't appear) No, actually there is no bug. If you read the contents of the 'show.pl' script, you'll see that it adapts to a missing '</text>' by reading from the next volume - the next recxxx...bz2.
  As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space. Since "text" and "title" appear all the time in these blocks (at least once for each article), they will NOT be split - they will be encoded as "references", and therefore, what you describe shouldn't happen (I hope :-)
  
  Parent Share
  twitter facebook
Hook it in to your desktop search... or... (Score:2)

by argent ( 18001 ) writes:

I didn't know about the wikipedia raw database, or I'd probably have done something like this myself, and hooked it into the UNIX "locate" db, or Spotlight, or maybe...

$ man w locate GNU Locate From Wikipedia, the free encyclopedia (redirected from Locate) ... This software-related article is a stub. You can help Wikipedia by expanding it.
sdict? (Score:2)

by jimmyfergus ( 726978 ) writes:

Anyone with experience of sdict [sdict.com]?
They offer a dictionary reader for various systems, including portable devices, and dictionaries including Wikipedia.
Unfortunately their Wikipedia dict is a old (January), but it seems like a good approach for laptops or other small devices. When I get an 8Gb SDHC I'm going to try it on my Nokia N800.
Not New (Score:2)

by Baavgai ( 598847 ) writes:

While interesting, it's certainly reinventing the wheel. There are lots of methods for doing this found on the site itself ( http://en.wikipedia.org/wiki/Wikipedia:Database_d o wnload [wikipedia.org] ) including static content already marked up.

Also, as others have noted, the his choping the file into chunks means you're going to loose at least one article per chunk.

I'd implemented this with a compressed file system and maybe some symlinks. Happily, the static content is already there for the taking. Some find and grep o
Gears? (Score:2)

by pragma_x ( 644215 ) writes:

I didn't see any posts on this so I thought I'd bring it up. I think the author took the long way around.

The author did some nifty hacking that resulted in the following stack of dependencies:

* Perl 5.8.5
* Python 2.5
* PHP 5.2.1
* Xapian 1.0.2
* Django 0.9.6

He cited not wanting to use a RDBMS since he's not writing to the database, just reading. I can give him that,
Now get it on a mobile phone (Score:2)

by Cato ( 8296 ) writes:

I live in an area with fairly bad mobile signal - I'm always trying to look things up on Wikipedia but finding I can't. Fortunately my Treo 680 smartphone can take 8GB SDHC cards (http://en.wikipedia.org/wiki/Treo_680 [wikipedia.org]), so I could fit this on with room to spare for MP3s and photos, and future growth of Wikipedia. Very tempting, though I'd need to port it to something like Lua and GCC - obviously the porting would be fairly trivial by the time Palm releases its Linux-based Treos in early 2008...
- Re:Uh.... (Score:5, Interesting)
  
  by dhwebb ( 526291 ) writes: on Monday August 13, 2007 @11:40PM (#20220837) Homepage Journal
  
  Programming something new to some people is like playing a video game. I love programming useless things just for the challenge. People who don't understand that have never had a true love for programming.
  
  Parent Share
  twitter facebook
  - Re: (Score:2, Insightful)
    
    by Tablizer ( 95088 ) writes:
    
    I love programming useless things just for the challenge.
    
    Have you ever worked on a project called "Clippey", by chance?
    - Re:Uh.... (Score:5, Funny)
      
      by Gazzonyx ( 982402 ) writes: <(moc.liamg) (ta) (grebnevol.ttocs)> on Tuesday August 14, 2007 @12:53AM (#20221377)
      
      I love programming useless things just for the challenge.
      
      Have you ever worked on a project called "Clippey", by chance?
      
      No, he said he has a love for programming; not a seething hatred for users. Besides, everyone knows programmers only hate admins. ;) On behalf of the programmers, I'd like to say that this isn't true we love our admins. Who else makes sure that our connections*&#^$: Connection Reset By Peer
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by LittleBigLui ( 304739 ) writes:
        
        No, he said he has a love for programming; not a seething hatred for users.
        
        As if that was possible.
        
        Re: (Score:2)
        
        by Gazzonyx ( 982402 ) writes:
        
        No, he said he has a love for programming; not a seething hatred for users.
        
        As if that was possible.
        Touche, my good man, touche.
  - Re:Uh.... (Score:4, Informative)
    
    by stephanruby ( 542433 ) writes: on Tuesday August 14, 2007 @12:38AM (#20221265)
    
    Programming something new to some people is like playing a video game.
    Speaking of which, http://www.pyweek.org/ [pyweek.org] is coming up this first week of September. It's time to dust off that python book (or borrow one from someone) and do whatever you have to do to get some days off that week.
    
    Parent Share
    twitter facebook
  - I know the feeling (Score:5, Insightful)
    
    by aepervius ( 535155 ) writes: on Tuesday August 14, 2007 @01:30AM (#20221543)
    
    They say to you that their hobby is painting/music/walking/repairing old car/gardening/making reduced model etc... And they seem to think that their hobby are perfectly acceptable. But as soon as you say you like to program stuff, they don't understand how this would be a hobby. They mostly fail to recognize that every one of us has something in common : the joy of act of creation. The fact that our hobby entail creating something immaterial and full of "logic" does not matter. It is still a joy.
    
    Parent Share
    twitter facebook
    - - Re: (Score:2)
        
        by fotbr ( 855184 ) writes:
        
        Heh, I hate walking on the beach, skiing, and rock climbing. Give me sailing, woodworking, and metalsmithing, amongst many others.
        
        Photography I do enjoy, but I have no delusions that everyone and their brother wants to see my photos, so I don't slap them all over flikr. On the other hand, my house is decorated entirely with photos I've taken, since *I* like them.
        
        Re: (Score:2)
        
        by orgelspieler ( 865795 ) writes:
        
        Here you go!
        
        long walk on a beach [flickr.com]
        skiing [flickr.com]
        rock climbing [flickr.com]
        photography [flickr.com]
        photography on a beach [flickr.com]
        rock climbing on a beach [flickr.com]
        photographing skiing (almost) [flickr.com]
        I couldn't find one of somebody photographing while skiing past a rock-climber on a beach, sorry.
  - Re: (Score:2)
    
    by Jugalator ( 259273 ) writes:
    
    First, I don't think this tool was useless -- it was a quick way of achieving his goal of off-line Wikipedia browsing. Second, when I program I prefer to make something useful to me, and I don't think it has to do with a lack of passion for programming. It's just that I'd rather see my time come to good use, even if I enjoy the process by itself.
  - Re: (Score:3, Interesting)
    
    by fotbr ( 855184 ) writes:
    
    I was that way, once. Then other hobbies came along, and now I rarely do any programming thats not work related.
    
    Its funny how time changes you.
- Re: (Score:2)
  
  by hobbesmaster ( 592205 ) writes:
  
  So you can settle trivial arguments with your friends when away from an internet connection, duh!
  
  (Or to always have something to read on your laptop while traveling - this is what I would use it for)
  - Re: (Score:3, Funny)
    
    by Gazzonyx ( 982402 ) writes:
    
    So you can settle trivial arguments with your friends when away from an internet connection, duh!
    
    (Or to always have something to read on your laptop while traveling - this is what I would use it for)
    I bet you're quite the ladies man, huh?
    Sorry, I couldn't resist!
- local resource, better interface (Score:2)
  
  by Zork the Almighty ( 599344 ) writes:
  
  Because it might be useful to have something stored locally. I travel a lot with my laptop and I would like this. I would also appreciate the convenience of not having to fire up a web browser for wikipedia. You can search articles from the command line. You could also potentially write a better search feature, ie: bolt on some code to combine and summarize multiple related articles. The approach the guy used (a bunch of small bz2 files) is interesting and potentially useful. I'd say this was one of t
- Re: (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  And that doesn't happen offline? Only naive people like you need to be worried about reading Wikipedia.
  
  There are bastards of every academic, social, and financial background.
- Re:Just hope you don't get an effed image. (Score:5, Insightful)
  
  by Tacvek ( 948259 ) writes: on Tuesday August 14, 2007 @12:14AM (#20221083) Journal
  
  My very serious question to you is how much better do you think things are at a "real" encyclopedia. They have many of the same problems, but they are just not public. "Real" encyclopedias can be just an inaccurate as the Wikipedia on many articles. For a quick first reference, Wikipedia is an ideal tool. Just be sure to take things with a grain of salt if you are not checking the sources for further information. Guess what though, the same applies to "real" encyclopedias too. One difference is that with "real" encyclopedias, you always lack revision information, and you often lack information about the sources used by the editors. (Some encyclopedias are better than others in that respect.)
  
  Parent Share
  twitter facebook
  - Re:Just hope you don't get an effed image. (Score:4, Funny)
    
    by gad_zuki! ( 70830 ) writes: on Tuesday August 14, 2007 @01:04AM (#20221429)
    
    Yes, the paper encyclopedias are missing all the anime trivia. Christ, its embarassing to see "references in pop culture" sections which just spell out every geeky guy stereotype. I dont know why those people dont get banned. Everything in existance has an anime reference. That is unsettling.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by AaronLawrence ( 600990 ) * writes:
      
      Well I agree, and I think there is a (near?) consensus in WP policy that such things are not useful, unless they are particularly notable, and should be deleted.
  - Re: (Score:2)
    
    by Jugalator ( 259273 ) writes:
    
    Unfortunately for Wikipedia, the quality or lack of it in competing encyclopedias does not resolve the problems in Wikipedia. I hope Wikipedia can work on these issues because I am seeing some of it too. I'm also seeing article rot being quite common, in that old articles deteriorate, and not really from a lack of good will either. Someone discuss the problem in a blog here: http://nonbovine-ruminations.blogspot.com/2007/02 / where-are-stable-versions.html [blogspot.com]
- Re: (Score:2)
  
  by Bombula ( 670389 ) writes:
  
  It might defend on the topic/field in question. The articles you reference seem to be focused on tech stuff. I use wikipedia primarily for socioeconomic reference material, and find it in general to be pretty solid. There are places where the depth is limited, but it's definitely my first-reach resource as long as I have an internet connection - mainly because many of the specific things I'm after might not be in a general encyclopedia like Britannica - intertemporal equilibrium, hedonic regression, Edge
  - Re: (Score:2)
    
    by Bombula ( 670389 ) writes:
    
    Yikes, defend = depend
- Re: (Score:3, Funny)
  
  by ZzzzSleep ( 606571 ) writes:
  
  Blatantly stolen from David Morgan-Mar [livejournal.com].
  In many of the more relaxed corners of the Outer Eastern Rim of the Internet, Wikipedia has already supplanted the great Encyclopaedia Britannica as the standard repository of all knowledge and wisdom, for though it has many omissions and contains much that is apocryphal, or at least wildly inaccurate, it scores over the older, more pedestrian work in two important respects.
  
  First, it is slightly cheaper; and secondly it has the words "anyone can edit" inscribed in l
- Re:2X (Score:5, Informative)
  
  by Brian Gordon ( 987471 ) writes: on Tuesday August 14, 2007 @12:24AM (#20221161)
  
  Ahaha, 2.9GB? That's the text alone. Images will net you more than 200GB [wikimedia.org] more. And yes, you do need a LAMP/WAMP and working mediawiki, but it wouldn't take 'days' it would take a few hours max. Also is this guy aware that wikipedia is available on DVD [wikipedia.org] already?
  
  Parent Share
  twitter facebook
  - Re:2X (Score:5, Informative)
    
    by TubeSteak ( 669689 ) writes: on Tuesday August 14, 2007 @01:07AM (#20221457) Journal
    
    Also is this guy aware that wikipedia is available on DVD already?
    Are you aware that the link you pointed to (1) is not the same thing as the link (2) the author pointed to?
    (1) http://schools-wikipedia.org/ [schools-wikipedia.org]
    (2) http://download.wikimedia.org/enwiki/latest/ [wikimedia.org]
    
    1 is 4625 articles hand picked for school age children, hence the website name
    2 is a straight dump of wikipedia
    
    Just imagine my surprise when the schools-wikipedia website didn't have the wiki article on Goatse!
    
    Parent Share
    twitter facebook
- - In what namespace? (Score:2)
    
    by tepples ( 727027 ) writes:
    
    Wikipedia seems the best place for the author's "how to download and use offline".
    No Original Research.
    In the article namespace. Research about Wikipedia appears to be encouraged in the Wikipedia: namespace [wikipedia.org].
- Re: (Score:3, Informative)
  
  by pla ( 258480 ) writes:
  
  Do you mean searching takes days, or loading? Searching should be quick if you index the words. If you are duplicating a bunch of local clones of wiki, then simply copy down the raw MySql table data files rather than reload from delimited files etc. (One needs to make sure their version of MySql is compatible with the table file format.)
  
  I suspect the former, plus creating the index, plus the not inconsiderable overhead of running an SQL server.
  
  DBs have their place. For a "real" Wiki, or more generally
  - - Re: (Score:2)
      
      by pla ( 258480 ) writes:
      
      Compared to bloated GUI's and fat device drivers, most database engine overhead is relatively minor in comparison
      
      I certainly wouldn't go that far. In the memory it takes to tolerably run a Wiki starting with a real dump, you could easily run three or four entire virtual systems. A basic XP or RedHat/Gnome system runs decently in 256MB. Import a 2.5GB BZipped Wiki with MySQL limited to 256MB and tell me how responsive it feels.
      
      I suspect the author just didn't want to bother to tune MySql.
      
      Nor sho
- Re: (Score:2)
  
  by mikeboone ( 163222 ) writes:
  
  I've been using Xapian too, and it works great. Mod Parent Up.
- Re: (Score:2)
  
  by Blackknight ( 25168 ) writes:
  
  Once data is imported MySQL is usually fast. Of course SQLite would be even faster, and no database server is needed, it's just a regular file. For a project like this SQLite would a perfect fit, I actually wrote my entire site using it.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Wow! (Score:3, Funny)

Re: (Score:3, Insightful)

Re:Why? (Score:5, Funny)

Just settle it the old way (Score:5, Funny)

Re: (Score:3, Funny)

Re: (Score:2)

Take that, Mr Obviously A. Troll! (Score:5, Funny)

Re: (Score:2, Funny)

Re:Take that, Mr Obviously A. Troll! (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Why? (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

SDHC? (Score:5, Informative)

Re: (Score:3, Informative)

Ho-Hum ... (Score:5, Funny)

Re:Ho-Hum ... (Score:5, Insightful)

Re: (Score:3, Funny)

Re: (Score:2)

I hope (Score:4, Funny)

Re:I hope (Score:5, Funny)

Re: (Score:2)

But... (Score:2, Funny)

Hitchhiker's guide here we come! (Score:5, Funny)

Re:Hitchhiker's guide here we come! (Score:5, Funny)

Re:Hitchhiker's guide here we come! (Score:5, Funny)

Re: (Score:2)

Re:Hitchhiker's guide here we come! (Score:5, Insightful)

Re:Hitchhiker's guide here we come! (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Hitchhiker's guide here we come! (Score:5, Funny)

Re: (Score:2)

Only 2 days huh (Score:2, Funny)

Re: (Score:2, Funny)

Good part of the page: the explanation (Score:5, Insightful)

Comment removed (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Mass inserts into mysql... (Score:4, Informative)

Re: (Score:2)

What?? (Score:5, Funny)

Re:What?? (Score:5, Funny)

The Point? (Score:2)

Re:The Point? (Score:5, Insightful)

Re: (Score:2)

What about moulin? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

...or the HTML export feature? (Score:2)

can we get a PSP version of it? (Score:4, Interesting)

better yet, a DS version (Score:4, Informative)

Or a PalmOS version of it? (Score:2)

wikipedia for iPod is already here... (Score:2)

There's a bug in TFA: Missing articles. (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re:There's a bug in TFA: Missing articles. (Score:4, Informative)

Hook it in to your desktop search... or... (Score:2)

sdict? (Score:2)

Not New (Score:2)

Gears? (Score:2)

Now get it on a mobile phone (Score:2)

Re:Uh.... (Score:5, Interesting)

Re: (Score:2, Insightful)

Re:Uh.... (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re:Uh.... (Score:4, Informative)

I know the feeling (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)