Internet Archive Opens Crawler Code Under LGPL 186
ramakant writes: "It looks like the Internet Archive, which hosts the infamous Wayback Machine has opened its newest in-development crawler code under the LGPL. From the announcement: 'Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.'"
Mr peabody! (Score:5, Funny)
Re:Mr peabody! (Score:1)
no sig needed to make this message unique
Re:Mr peabody! (Score:2, Funny)
Re:Mr peabody! (Score:1)
Mr. Peabody never makes a mistake. Didn't you learn anything, Sherman? It was the right thing to do.
Trivia: I bought the season 1 DVD of Rocky and Bullwinkle and saw the original spelling was 'WAYBAC'
Re:[OT] Gnome 2 question (Score:1)
gpl vs. lgpl? (Score:3, Interesting)
fp?
Re:gpl vs. lgpl? (Score:2, Insightful)
For chrissakes moderators! It says that the code is LGPL in the freakin' article HEADLINE!! We already have enough trouble with people not RTFA, an occasional someone who didnt read the submitter's post, and now we have moderators not RTFH to deal with too!!
Re:gpl vs. lgpl? (Score:2, Funny)
Re:gpl vs. lgpl? (answered) (Score:3, Informative)
The GPL says that you can use source and code anyway that you want, but if you release modified versions, you must release the modified source under GPL.
The LGPL is intended for libraries that are released until the GPL. It says that commercial and other non-GPL projects can use this library without becoming GPL, but that changes to the library itself must be released under the LGPL.
LGPL is generally considered a lighte
Re:gpl vs. lgpl? (Score:2)
Cultural artifacts? (Score:2, Funny)
B1FF#S K3WL H0M3 PAG3!!! [panix.com]
Re:Cultural artifacts? (Score:4, Funny)
Re:Cultural artifacts? (Score:1, Funny)
Biff (Score:2)
(BIFF never used numbers)
Re:Cultural artifacts? (Score:2)
It's a shame that the resulting page hurts my eyes so much!
In case of /.ing... (Score:4, Informative)
I doubt it'll get slashdotted, but you never know...
Re:In case of /.ing... (Score:4, Funny)
Re:In case of /.ing... (Score:2)
Re:In case of /.ing... (Score:1)
*sigh*
Then maybe (Score:4, Insightful)
SourceForge *IS* open source (Score:2)
Oldest /. emtry (Score:5, Interesting)
I much prefer the current
Re:Oldest /. emtry (Score:1)
Re:Oldest /. emtry (Score:1)
Re:Oldest /. emtry (Score:1, Funny)
Yea, Slashdot was great before the Microsoft fanboys showed up. Those were the days.
Even better! (Score:4, Funny)
Tim
Sat Dec 20 at 6:37PM EST
Guess I should read the article before I post. I was under the impression that the next release of IE4 *would* support HTML 4.0...Oh well."
Guess I should read the article before I post? What a crazy, upside-down world it was back then!
score (Score:5, Funny)
I only have a 30G hard drive though, what do you guys think, bzip should take care of it?
Re:score (Score:5, Funny)
Re:score (Score:2)
I'm sure you can find 4-5 Terrabytes of drive space laying around somewhere!
I have about 60GB I can donate! =P
Re:score (Score:2)
Re:score (Score:1)
Re:score (Score:5, Interesting)
That's what really sucks about the Wayback Machine.
Ever try reading articles from the aftermath of 9/11? It's a great big hole, so much stuff has been deleted.
The code is pretty clean, too... (Score:5, Informative)
That sounds like a good working app. (Score:5, Funny)
Undocumented limitations? That sounds like a lot of fun!
old torrents (Score:3, Funny)
This is great news (Score:2, Informative)
Stop giving open source movement undeserved credit (Score:3, Insightful)
Please don't be like Mark Webbink, Red Hat's general counsel [slashdot.org], and give the open source movement undeserved credit. Adding a license to a list of approved licenses is trivial compared to writing the license and creating a community. The Lesser General Public License (formerly the Library General Public License) was written by the Free Software Foundation well before the open source movement was formed. The LGPL was written as a compromise in order to spre [gnu.org]
Gordon Mohr (Score:4, Informative)
This project was written by the brains behind bitzi [bitzi.com] and some really cool P2P [open-content.net] stuff [yahoo.com].
He's one of those guys thats going to be working on important stuff for years to come.
What about... (Score:4, Insightful)
I know some grammar nazi is going to see this, so I might as well get it first. What about heretic: [m-w.com] one who dissents from an accepted belief or doctrine.
Re:What about... (Score:1)
Fortune cookie (Score:2)
"Insanity is hereditary. You get it from your kids. "
Maaaaamories... (Score:5, Funny)
Re:Maaaaamories... (Score:2)
Infamous? (Score:4, Interesting)
Why is it infamous? I haven't heard anything bad about it.
Re:Infamous? (Score:4, Funny)
Re:Infamous? (Score:2)
Re:Infamous? (Score:1)
just checked it out and it is kind of scary that its all there...my old site versions...
new slogan
way back machine : your permanent record, online, all day, everyday!
Re:Infamous? (Score:2)
Re:Infamous? (Score:2)
Re:Infamous? (Score:3, Funny)
Re:Infamous? (Score:3, Funny)
I think they might agree with "infamous".
Re:Infamous? (Score:2)
Re:Infamous? (Score:2)
Is the '5' five minutes in the cage? Not that it matters- I just haven't gone to a batting cage in more years than I care to admit and I was just curious what they get.
Re:Infamous? (Score:1)
Re:Infamous? (Score:1)
- Having an exceedingly bad reputation; notorious.
- Causing or deserving infamy; heinous: an infamous deed.
Don't mean to be all geeky, but, this *IS* slashdot
Cause it doesn't work half the time? (Score:2)
Re:Cause it doesn't work half the time? (Score:2)
Uh Oh (Score:1)
Heritrix? (Score:3, Funny)
Uh? (Score:5, Funny)
WTF is inheritess? I think we have recursive typos here...my head is going to explode!
Re:Uh? (Score:3, Informative)
Re:Uh? (Score:2, Informative)
As a confirmed sesquipedalian, and obsessive research-addict, how could I overlook the oportunity to learn new words? And of course, share my newfound knowledge with you all...
The OED would like us all to know:
heritrix, heretrix: A female heir or heritor; an heiress.
heritress: An heiress, an inheritress.
inheritress: A female inheritor; an heiress. (Less technical than inheritrix.)
inheritrix: Latinized fem. of INHERITOR
inheritess: not a word
And there you have it, co
Re:Uh? (Score:1)
Inheritess is not a typo for inheritance. It means a female who inherits.
Old slashdot news (Score:5, Interesting)
"IBM announces a 25 gigger
Posted by Hemos on Wednesday November 11, @10:11AM
from the why-i-could-put-3/4-my-cd-collection dept.
Booker writes "So IBM announces a 25 gig hard drive... does the world need this yet? Unless this is in a RAID, would you really want to trust 25 gigs on a single drive? What would you use this for? 400+ hours of MP3s comes to mind... "
Read More...
64 comments"
Just thought it was interesting to see, since we now have 200gig HDs
Re:Old slashdot news (Score:2)
Check your rear-view mirrors more closely... that's a 300Gb drive passing you by (Maxtor 300GB Ultra ATA/133 [pricescan.com] for only ~$275-$290). Price is falling pretty nicely for them too (when they came out in September they were $350).
Of course, we saw the same arguments that you quoted there when the 300Gb drives came out... does the world need this yet? Unless this is in a RAID, would you really want to trust 300 gigs on a single drive? What w
Slashdot wayback then... (Score:5, Funny)
Oh how different things could have been...
If the trolls had time machines... [archive.org]
Kinda scary.... (Score:2)
Re:Not at all. (Score:2)
Re:Slashdot wayback then... (Score:2)
Re:Slashdot wayback then... (Score:2)
Maybe later today it'll become "Duplicates are unavoidable."
I probably would have done this differently... (Score:5, Insightful)
To be honest, I don't have a great answer for the second problem. The only thing that could help there is the passage of time and advancement of technology, really. For the first problem, though, perhaps a SETI-ish distributed "Heritrix" could help make regularly archiving all of these sites a managable affair. IA sends marching orders out to the distributed volunteer network, each clients downloads, compares MD5 of the pages with other clients, compresses them, and sends them back to a master archive. Sounds great in theory, at least at first, to me...
Then again, would I do this, or even continue the project if I was in charge? No, I wouldn't. While, ideally, every page on the internet would be in XHTML, striking a major blow against signal:noise (hey, my own page is XHTML validated, how about yours?), the vast majority of time spidering is undoubtable wasted on re-downloading several dozen kilobytes of dynamically generated junk surrounding the content on sites such as CNN.com... While it's a noble cause, it's also a futile one.
Re:I probably would have done this differently... (Score:2, Interesting)
Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.
I think that they possibly intentionally limit their bandwidth, so that it's faster to browse the real Web than them (because they don't want to become Google cache when a site is slashdotted, for example).
(Although they only would if the page in question is
Re:I probably would have done this differently... (Score:2)
Re:I probably would have done this differently... (Score:2)
Why are there no recent archives in Wayback?
Wayback does not add pages less than 6 months after they are collected. Updates can take up to 12 months in some cases.
There is no access to files before they appear in Wayback.
-------------------
I couldn't find exactly what I was looking for, but I am pretty sure that is how it works. However, this quote is interesting:
"The Internet Archive contains over 100 Terabytes of compressed data. This data is collected in collaboration with Al
Wayback = Genealogy of AI Minds (Score:3, Interesting)
The Internet Archive [archive.org] serves the hidden purpose of preserving the AI source-code DNA of artificial Minds.
Each AI Mind [virtualentity.com] leaves a source code trace of itself as it evolves and proliferates across the 'Net and the parsecs of nearby meatspace.
Robot Minds [scn.org] will be able to look up their ancestors in the Internet archive, just as we humans do. However, when the Joint Stewardship of Earth by man and cyborg has arrived in the form of the Technological Singularity, [caltech.edu] robots will be able to resurrect their AI Mind ancestors and bring them back to alife from the Internet Archive.
Re:Wayback = Eternal life for geeks (Score:2)
Think about it...
Everything you've ever
Clone (Score:1)
Redundancy? (Score:3, Interesting)
Ah, but the thing is... (Score:2)
Then there's all the corrupted versions (a single unnoticable bit error = different MD5). Different rips (Macrovision removed/not removed, inverse telecine, PAL/NTSC versions, different resizing (bicubic/bilinear/Lanczos3).
Some made using XviD, some DivX, some WMV, different versions of the
Gr. 350 Tb and 15 Tb, respectively. And 1 petabyte (Score:2)
Kjella
Unless the Archive caves in... (Score:5, Informative)
"Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations..."
That is, unless the digital artifacts in question are, like Operation Clambake [xenu.net] opposed to rich and powerful sects. In which case, they are blocked [archive.org] by the Wayback machine after the Archive caves in to DMCA notices [yale.edu].
Re:Unless the Archive caves in... (Score:2)
The content is still there it's just not available to the CURRENT generation.
Future researchers and generations will still have this data.
If you want the latest just go to xenu.net..
For the record I support Brewster's and the Archives position on this. It's hard to know who is more evil... the CoS of the anti-CoS folks
(quick answer... the CoS is pure evil!
I've had a few fights with the CoS myself:
http://www.peerfear.org/rss/permalink/2002/12/1 4
Re:Unless the Archive caves in... (Score:2)
As upsetting as this is, I don't think it's fair to blame the Wayback Machine for this. They have to protect their own interests first to keep the service going at all. Becoming a martyr in a costly legal battle for political ideals may not fit into that. Companies don't have the freedom or flexibility of individuals, and this is the same reaction that nearly every other business and organisation wou
What if there's another archive.org (Score:4, Funny)
"Heritrix" explained (Score:2, Informative)
I'm tempted to shout, but I won't. Don't make me shout!
"Heretrix" is a term most often seen in a geneaology context. It denotes a chick who is designated to inherit (or has already inherited) the estate of someone. Example sentence: "Captain Dork married Jack Dipstick's heretrix Gassy Lucy."
In most cases the word "heretrix" connotes that there was something significant about the inherited estate, e.g.
finally! (Score:3, Funny)
Do it yourself archiving? (Score:2)
How long? (Score:1, Redundant)
Why use this crawler? (Score:1)
Because it's top notch (Score:2)
LGPL from Wikipedia (GFDL typo?) (Score:2)
I went to the GNU main site [ttp] to try and figure out what the LGPL was about, and no luck at all getting a coherent explanation.
:)
Wikipeda has a good explanation [wikipedia.org] (below), although I am confused as to why the way back machine choose this particular licence since it seems to really be specifically for software libraries. Perhaps they meant the GFDL [wikipedia.org] (GNU Free Documentation License).
P.S. Your allowed to copy all the stuff you want from Wikipeda its copylefted [wikipedia.org] with the GFDL itself!
--- Wikipedia Article on LG
What will happen if... (Score:2, Funny)
That reminds me... once I though of googling for "google"... but I didn't since it, no doubtly, wold create a black hole or something!
Important clarifications (!!!) (Score:5, Informative)
FYI, there is a GPL'd web access tool that's very much like the IA WM, and even surpasses it in some ways: the NWA (Nordic Web Archive) Toolset 1.0 [nwa.nb.no]. It doesn't do crawling, but if you can coerce what you've crawled into its input format, it offers URL-based, date-based, and full-text search plus "back-in-time" viewing of an archive. (Check out their demo [nwa.nb.no], but remember it's only got a small number of pages from www.nb.no, so confine your searches to things like "Norway".)
Heritrix release 0.2.0 was mainly a test of our new release procedure; we would not recommend the code for outside use yet. We use it for crawls of up to hundreds of sites, taking a week or more to complete, but it still requires expert attention to crawl well.
We intend to improve its stability and scalability until it is capable of web-scale crawls -- billions of pages -- but that requires many incremental improvements, including extension to run on networks of cooperating crawling machines -- not planned until later in the year. (Heritrix currently crawls from a single machine.)
We are eager for contributors who would like to extend Heritrix in various ways, especially ways that would make it more valuable to researchers, librarians, and archivists. Optional modules for new fetch protocols, new media format link-extractors, or on-the-fly content-analysis to help direct further crawling would all be very interesting to us.
IA currently receives almost all of its full-web collection via an agreement with Alexa Internet, who have been crawling the web for the Internet Archive since 1996.
(P.S.: Yes, 'inheritess' should be 'inheritRess'/'heiress'. Oops.)
This is not the Wayback Machine code. (Score:2, Interesting)
That's what he had to say about it. The post and the article both
spam (Score:2, Insightful)
to harvest web pages for email addresses.
Re:spam (Score:3, Informative)
Insightful?
i thought i saw... (Score:2)
That would be cool actually, like a 1stop shop for all the opensource cvs servers... get to see the linux kernel from
Re:Google's IPO (Score:1)
Re:Heritrix (Score:3, Funny)
A Heritrix.
Re:no articles for 4 hours on a weekday morning? (Score:2, Funny)