Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
The Internet Technology

Internet Archive Opens Crawler Code Under LGPL 186

ramakant writes: "It looks like the Internet Archive, which hosts the infamous Wayback Machine has opened its newest in-development crawler code under the LGPL. From the announcement: 'Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.'"
This discussion has been archived. No new comments can be posted.

Internet Archive Opens Crawler Code Under LGPL

Comments Filter:
  • Mr peabody! (Score:5, Funny)

    by Anonymous Coward on Wednesday January 07, 2004 @10:40AM (#7902902)
    They've open sourced your wayback machine! Now you've lost the monopoly!
    • The Way Back machine - one way that we as a species of tech savvy gurus can travel back in time...now, if only they could figure out how to reverse the technology and travel forward ;-)

      no sig needed to make this message unique
      • by Anonymous Coward
        I don't know about you but I have no problem traveling forward in time. It is getting back that is the real trick.
    • They've open sourced your wayback machine! Now you've lost the monopoly!

      Mr. Peabody never makes a mistake. Didn't you learn anything, Sherman? It was the right thing to do.

      Trivia: I bought the season 1 DVD of Rocky and Bullwinkle and saw the original spelling was 'WAYBAC'

  • gpl vs. lgpl? (Score:3, Interesting)

    by Anonymous Coward on Wednesday January 07, 2004 @10:41AM (#7902906)
    could someone summarize the differences?

    fp?
    • Re:gpl vs. lgpl? (Score:2, Insightful)

      by Anonymous Coward
      this ain't OT. The guy asked what the difference was between the GPL and LGPL. LGPL being the license the wayback code is being placed under, the opening of the code being the topic of discussion. Therefore, the post couldn't be any more on-topic.

      For chrissakes moderators! It says that the code is LGPL in the freakin' article HEADLINE!! We already have enough trouble with people not RTFA, an occasional someone who didnt read the submitter's post, and now we have moderators not RTFH to deal with too!!
    • by Anonymous Coward
      One is communist, the other is socialist.
    • I'm quite certain that people will correct me (at length) if I'm wrong, but here goes.

      The GPL says that you can use source and code anyway that you want, but if you release modified versions, you must release the modified source under GPL.

      The LGPL is intended for libraries that are released until the GPL. It says that commercial and other non-GPL projects can use this library without becoming GPL, but that changes to the library itself must be released under the LGPL.

      LGPL is generally considered a lighte
    • From the GNU LGPL Preamble [gnu.org]:

      Most GNU software, including some libraries, is covered by the ordinary GNU General Public License. This license, the GNU Lesser General Public License, applies to certain designated libraries, and is quite different from the ordinary General Public License. We use this license for certain libraries in order to permit linking those libraries into non-free programs.

      When a program is linked with a library, whether statically or using a shared library, the combination of the two

  • You mean works of art like this?

    B1FF#S K3WL H0M3 PAG3!!! [panix.com]
  • In case of /.ing... (Score:4, Informative)

    by Dave2 Wickham ( 600202 ) * on Wednesday January 07, 2004 @10:41AM (#7902915) Journal
    The source download is available on sourceforge [sourceforge.net].

    I doubt it'll get slashdotted, but you never know...
  • Then maybe (Score:4, Insightful)

    by caston ( 711568 ) on Wednesday January 07, 2004 @10:42AM (#7902921)
    OSDN can decide to open source source forge...
  • Oldest /. emtry (Score:5, Interesting)

    by Anonymous Coward on Wednesday January 07, 2004 @10:44AM (#7902945)
    Look, ma - no trolls!! But anti-MS comments in da hizzouse!! [archive.org]

    I much prefer the current /.
  • score (Score:5, Funny)

    by TedCheshireAcad ( 311748 ) <ted AT fc DOT rit DOT edu> on Wednesday January 07, 2004 @10:45AM (#7902954) Homepage
    Score! Now I can run my own wayback machine!

    I only have a 30G hard drive though, what do you guys think, bzip should take care of it?
    • Re:score (Score:5, Funny)

      by bamf ( 212 ) on Wednesday January 07, 2004 @10:49AM (#7902986)
      If you limit yourself to only archiving the useful parts of the interweb, you should be able to fit it all on floppy disk or two.
    • Thats a great idea!
      I'm sure you can find 4-5 Terrabytes of drive space laying around somewhere!
      I have about 60GB I can donate! =P
    • You can't. SCO now claims ownership of every line of GPL code. Barely stretching it, the Internet Archive (and thus Internet itself) can be seen as SCO's IP as "derivative work". You'll send a $699 check to the order of D. McBride, Salt Lake City UT 84101 every time you connect to your ISP. Ka-ching!
    • Re:score (Score:5, Interesting)

      by corebreech ( 469871 ) on Wednesday January 07, 2004 @12:14PM (#7903786) Journal
      I'll use it if you promise not to delete shit that doesn't hew to your ideology.

      That's what really sucks about the Wayback Machine.

      Ever try reading articles from the aftermath of 9/11? It's a great big hole, so much stuff has been deleted.
  • by tcopeland ( 32225 ) * <tom@NoSPaM.thomasleecopeland.com> on Wednesday January 07, 2004 @10:47AM (#7902968) Homepage
    ...some unused variables [infoether.com] and such-like in there, though, as reported by PMD [sf.net].
  • From their FAQ: if you are comfortable grabbing code directly from CVS, wrestling with incomplete documentation, and running into undocumented limitations, would you want to use the current software.
    Undocumented limitations? That sounds like a lot of fun!
  • by kyoko21 ( 198413 ) on Wednesday January 07, 2004 @10:50AM (#7902997)
    Nothing like crawling for old, recycled, and dead torrents.
  • This is great news (Score:2, Informative)

    by CompWerks ( 684874 )
    Open source that handles over 300tb of data!
  • Gordon Mohr (Score:4, Informative)

    by Orasis ( 23315 ) on Wednesday January 07, 2004 @10:50AM (#7903007)
    Congrats Gojomo!

    This project was written by the brains behind bitzi [bitzi.com] and some really cool P2P [open-content.net] stuff [yahoo.com].

    He's one of those guys thats going to be working on important stuff for years to come.
  • What about... (Score:4, Insightful)

    by herrvinny ( 698679 ) on Wednesday January 07, 2004 @10:51AM (#7903012)
    Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess.

    I know some grammar nazi is going to see this, so I might as well get it first. What about heretic: [m-w.com] one who dissents from an accepted belief or doctrine.

  • Beneath this article I noticed this fortune cookie:

    "Insanity is hereditary. You get it from your kids. "

  • by Dorf on Perl ( 738169 ) on Wednesday January 07, 2004 @10:54AM (#7903036)
    This is a great step forward, I welcome our archiving overlords, etc. Right now when I want to share some of my history (the good stuff, natch) with my kids, I have to dig out an old, musty shoebox full of junk. When they want to share theirs with their kids, they'll just beam a URL into my grandkids' in-skull HUDs. While in their flying cars. "Oh look, here's another stupid post to Slashdot by Grandpa..."
  • Infamous? (Score:4, Interesting)

    by BitchAss ( 146906 ) on Wednesday January 07, 2004 @10:59AM (#7903081) Homepage
    the infamous Wayback Machine

    Why is it infamous? I haven't heard anything bad about it.
  • I think we /.'d sf.net...either that or its conviently not accessable right after I see it linked from slashdot.

  • Heritrix? (Score:3, Funny)

    by elgrinner ( 472922 ) on Wednesday January 07, 2004 @11:02AM (#7903109)
    Sounds a bit like Asterix' grandfather.
  • Uh? (Score:5, Funny)

    by Zog The Undeniable ( 632031 ) on Wednesday January 07, 2004 @11:02AM (#7903110)
    Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess.

    WTF is inheritess? I think we have recursive typos here...my head is going to explode!

    • Re:Uh? (Score:3, Informative)

      by gojomo ( 53369 )
      'Inheritess' is femal form of 'inheritor' -- 'someone who inherits' (female). AKA 'heiress'.
    • Re:Uh? (Score:2, Informative)

      by phiala ( 680649 )
      The OED online is my friend!

      As a confirmed sesquipedalian, and obsessive research-addict, how could I overlook the oportunity to learn new words? And of course, share my newfound knowledge with you all...

      The OED would like us all to know:
      heritrix, heretrix: A female heir or heritor; an heiress.
      heritress: An heiress, an inheritress.
      inheritress: A female inheritor; an heiress. (Less technical than inheritrix.)
      inheritrix: Latinized fem. of INHERITOR

      inheritess: not a word

      And there you have it, co

    • Inheritess is not a typo for inheritance. It means a female who inherits.

  • Old slashdot news (Score:5, Interesting)

    by AyeFly ( 242460 ) on Wednesday January 07, 2004 @11:04AM (#7903129)
    here is a slashdot story from wayback i just found.

    "IBM announces a 25 gigger

    Posted by Hemos on Wednesday November 11, @10:11AM
    from the why-i-could-put-3/4-my-cd-collection dept.
    Booker writes "So IBM announces a 25 gig hard drive... does the world need this yet? Unless this is in a RAID, would you really want to trust 25 gigs on a single drive? What would you use this for? 400+ hours of MP3s comes to mind... "
    Read More...
    64 comments"

    Just thought it was interesting to see, since we now have 200gig HDs
    • Just thought it was interesting to see, since we now have 200gig HDs

      Check your rear-view mirrors more closely... that's a 300Gb drive passing you by (Maxtor 300GB Ultra ATA/133 [pricescan.com] for only ~$275-$290). Price is falling pretty nicely for them too (when they came out in September they were $350).

      Of course, we saw the same arguments that you quoted there when the 300Gb drives came out... does the world need this yet? Unless this is in a RAID, would you really want to trust 300 gigs on a single drive? What w
  • by OpCode42 ( 253084 ) on Wednesday January 07, 2004 @11:04AM (#7903130) Homepage
    Just been looking at some slashdot pages from 1997... quote from the "Post your comments here!" form : "If you don't have anything worthwhile to say, don't say it. If people continue to abuse this feature, I will have to remove it."

    Oh how different things could have been... ;-)

    If the trolls had time machines... [archive.org]
  • by Rahga ( 13479 ) on Wednesday January 07, 2004 @11:06AM (#7903140) Journal
    Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.

    To be honest, I don't have a great answer for the second problem. The only thing that could help there is the passage of time and advancement of technology, really. For the first problem, though, perhaps a SETI-ish distributed "Heritrix" could help make regularly archiving all of these sites a managable affair. IA sends marching orders out to the distributed volunteer network, each clients downloads, compares MD5 of the pages with other clients, compresses them, and sends them back to a master archive. Sounds great in theory, at least at first, to me...

    Then again, would I do this, or even continue the project if I was in charge? No, I wouldn't. While, ideally, every page on the internet would be in XHTML, striking a major blow against signal:noise (hey, my own page is XHTML validated, how about yours?), the vast majority of time spidering is undoubtable wasted on re-downloading several dozen kilobytes of dynamically generated junk surrounding the content on sites such as CNN.com... While it's a noble cause, it's also a futile one.
    • Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.

      I think that they possibly intentionally limit their bandwidth, so that it's faster to browse the real Web than them (because they don't want to become Google cache when a site is slashdotted, for example).

      (Although they only would if the page in question is

      • I thought the reason they don't get the pages for 6 months is because Alexa (in exchange for sponsorship) gets the exclusive rights to the archive for the first 6 months. I'm too lazy to look it up now, but I think I read that.
  • by Mentifex ( 187202 ) on Wednesday January 07, 2004 @11:06AM (#7903147) Homepage Journal

    The Internet Archive [archive.org] serves the hidden purpose of preserving the AI source-code DNA of artificial Minds.
    Each AI Mind [virtualentity.com] leaves a source code trace of itself as it evolves and proliferates across the 'Net and the parsecs of nearby meatspace.
    Robot Minds [scn.org] will be able to look up their ancestors in the Internet archive, just as we humans do. However, when the Joint Stewardship of Earth by man and cyborg has arrived in the form of the Technological Singularity, [caltech.edu] robots will be able to resurrect their AI Mind ancestors and bring them back to alife from the Internet Archive.
    • And many a geek without a RL will achieve eternal life when their personality (as expressed through pointed comments), experiences (as expressed through pointless anecdotes) and knowledge (as expressed through worthless advice) and thus their consciousness and LIVING MIND ITSELF, is painstakingly put back together by the same future race which will unfreeze the richer geeks from their cryogenic deathsleeps, from the myriad holographic shreds on the archived internet.

      Think about it...

      Everything you've ever
  • I wonder how long it will be till we see a new site open using the code...
  • Redundancy? (Score:3, Interesting)

    by Anonymous Coward on Wednesday January 07, 2004 @11:11AM (#7903180)
    The Internet is huge. But get rid of all the redundancy and the size goes down by a huge factor. How many copies of the Linux kernel and distros are there? How many copies of Matrix Reloaded? Do an MD5 sum and store pointers in order to recreate the structure of the net, keeping only one copy of what is unique. Terrabyte servers are cheap these days. Wouldn't need more than a few at the most to archive everything.
    • ...while there may be unique content, there's certainly not unique versions. I'm sure there's many different rips of Matrix Reloaded. First off, there's all the various screener / preview dvd / telesync / DVD releases.

      Then there's all the corrupted versions (a single unnoticable bit error = different MD5). Different rips (Macrovision removed/not removed, inverse telecine, PAL/NTSC versions, different resizing (bicubic/bilinear/Lanczos3).

      Some made using XviD, some DivX, some WMV, different versions of the
  • by turambar386 ( 254373 ) <turambar386.routergod@com> on Wednesday January 07, 2004 @11:25AM (#7903309) Homepage

    "Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations..."

    That is, unless the digital artifacts in question are, like Operation Clambake [xenu.net] opposed to rich and powerful sects. In which case, they are blocked [archive.org] by the Wayback machine after the Archive caves in to DMCA notices [yale.edu].



    • Not true... they are just dark archives.

      The content is still there it's just not available to the CURRENT generation.

      Future researchers and generations will still have this data.

      If you want the latest just go to xenu.net..

      For the record I support Brewster's and the Archives position on this. It's hard to know who is more evil... the CoS of the anti-CoS folks ;)

      (quick answer... the CoS is pure evil! ;)

      I've had a few fights with the CoS myself:

      http://www.peerfear.org/rss/permalink/2002/12/1 4 /1 03990
    • In which case, they are blocked by the Wayback machine after the Archive caves in to DMCA notices.

      As upsetting as this is, I don't think it's fair to blame the Wayback Machine for this. They have to protect their own interests first to keep the service going at all. Becoming a martyr in a costly legal battle for political ideals may not fit into that. Companies don't have the freedom or flexibility of individuals, and this is the same reaction that nearly every other business and organisation wou

  • by British ( 51765 ) <british1500@gmail.com> on Wednesday January 07, 2004 @11:27AM (#7903323) Homepage Journal
    ...and archive.org tries to archive it? Will it go into an infinite loop,or just have 2 copies of the interweb?
  • "Heritrix" explained (Score:2, Informative)

    by skidoo2 ( 650483 )
    Sheesh. Let me put this one to bed before it snowballs into a big cloud of impenetrable Times New Roman.

    I'm tempted to shout, but I won't. Don't make me shout!

    "Heretrix" is a term most often seen in a geneaology context. It denotes a chick who is designated to inherit (or has already inherited) the estate of someone. Example sentence: "Captain Dork married Jack Dipstick's heretrix Gassy Lucy."

    In most cases the word "heretrix" connotes that there was something significant about the inherited estate, e.g.
  • finally! (Score:3, Funny)

    by badansible ( 630677 ) on Wednesday January 07, 2004 @11:30AM (#7903353)
    I will be able to look at that exciting gopher site everybody was talking about! Yes?
  • Guess this solves this guys problem [slashdot.org].
  • How long? (Score:1, Redundant)

    by Raven42rac ( 448205 )
    How long until SCO claims that the code is theirs?
  • There's a huge number of open source web crawlers available already on SourceForge [sourceforge.net] and elsewhere. Anyone know the advantages and disadvantages of this one over the others?
    • Brewster Kahle and Alexa Internet are the real deal. This isn't some undergrad's CS-101 project, it's a tool designed from the very start to archive the entire web. And it does it on a regular basis. Even if there's a really good SourceForge project (you didn't cite any of them), Alexa's should be a first stop for anyone interested in the task.
  • I went to the GNU main site [ttp] to try and figure out what the LGPL was about, and no luck at all getting a coherent explanation.

    Wikipeda has a good explanation [wikipedia.org] (below), although I am confused as to why the way back machine choose this particular licence since it seems to really be specifically for software libraries. Perhaps they meant the GFDL [wikipedia.org] (GNU Free Documentation License).

    P.S. Your allowed to copy all the stuff you want from Wikipeda its copylefted [wikipedia.org] with the GFDL itself! :)

    --- Wikipedia Article on LG

  • ...wayback inadvertently archives itself?!?!

    That reminds me... once I though of googling for "google"... but I didn't since it, no doubtly, wold create a black hole or something!
  • by gojomo ( 53369 ) on Wednesday January 07, 2004 @12:26PM (#7903897) Homepage
    Heritrix is just a crawler for collecting web resources recursively, within some defined parameters -- it doesn't offer Internet Archive Wayback Machine (IA WM) functionality.

    FYI, there is a GPL'd web access tool that's very much like the IA WM, and even surpasses it in some ways: the NWA (Nordic Web Archive) Toolset 1.0 [nwa.nb.no]. It doesn't do crawling, but if you can coerce what you've crawled into its input format, it offers URL-based, date-based, and full-text search plus "back-in-time" viewing of an archive. (Check out their demo [nwa.nb.no], but remember it's only got a small number of pages from www.nb.no, so confine your searches to things like "Norway".)

    Heritrix release 0.2.0 was mainly a test of our new release procedure; we would not recommend the code for outside use yet. We use it for crawls of up to hundreds of sites, taking a week or more to complete, but it still requires expert attention to crawl well.

    We intend to improve its stability and scalability until it is capable of web-scale crawls -- billions of pages -- but that requires many incremental improvements, including extension to run on networks of cooperating crawling machines -- not planned until later in the year. (Heritrix currently crawls from a single machine.)

    We are eager for contributors who would like to extend Heritrix in various ways, especially ways that would make it more valuable to researchers, librarians, and archivists. Optional modules for new fetch protocols, new media format link-extractors, or on-the-fly content-analysis to help direct further crawling would all be very interesting to us.

    IA currently receives almost all of its full-web collection via an agreement with Alexa Internet, who have been crawling the web for the Internet Archive since 1996.

    (P.S.: Yes, 'inheritess' should be 'inheritRess'/'heiress'. Oops.)
  • A friend from another messageboard is working on this project, and just posted to let us know that he's been /.ed (which is sort of a cool thing in the geek world).

    And of course they got it all wrong. Heritrix != WayBackMachine.

    Heritrix gathers web pages (harvests)
    The WayBackMachine gives access to harvested material.

    Also Heritrix is a new web crawler meant to replace the one that IA has been using (which is owned by Alexa Internet).

    That's what he had to say about it. The post and the article both

  • spam (Score:2, Insightful)

    by krokodil ( 110356 )
    I am afraid spammers may use this code
    to harvest web pages for email addresses.
    • Re:spam (Score:3, Informative)

      by elemental23 ( 322479 )
      Don't lose any sleep over it, spammers have had tools to harvest the web for e-mail addresses for years.

      Insightful?
  • I first read the headline and i thought it said the Internet Archive would be archiving L/GPL code.

    That would be cool actually, like a 1stop shop for all the opensource cvs servers... get to see the linux kernel from .01 to 2.6.0 and a couple thousand other applications too. Oh well, the real story is neat too.

My sister opened a computer store in Hawaii. She sells C shells down by the seashore.

Working...