Gnutella Technology Powers New Search Engine 67
Matrium writes: "News.com (owned by CNet) is running an article on how the makers of Gnutella have turned their decentralized model of information swapping away from music and porn, and are now looking at search engines. InfraSearch is still in beta, but it does offer an interesting look in the evolution of the Internet." InfraSearch presently paws through only a few search sites, but as a concept really intrigues me. For one thing, it introduces the long-overdue concept of "how long to search" right into the query dialogue.
Furi has similar feature (Score:1)
Re:The ‘system’ has brought this on itself. (Score:1)
It would be a real error to claim that there was an origin where it was all totally 'free' and anonymous.
What about Privacy? (Score:1)
Even if the software as distibuted to disallow this, it's quite possible for the site to tweak the code to this purpose (one of the few downsides to open source). This could be used for targetted spamming, building enemies lists, etc. Since there's no way to know what systems your search will hit, there's no way of knowing what their (stated or actual) privacy policies are.
I'm less paranoid than some here, but this still lit my worry button.
One possibility... (Score:1)
Of course, this means that the procedure(s) used to generate the relevance rating must be publically available. But there's nothing stopping the source site from using a different rating system internally to find the pages, as long as they use a public algorithm for the rating that's returned with the page...
Would this help at all?
Re:The ?system? has brought this on itself. (Score:1)
Time to search (Score:1)
Inference find http://www.infind.com
has LONG had a time to search option right after the field where you do the search.
Re:An important step (Score:1)
I'd like to see a distributed version of consumer reports.
Re:== Do it yourself DDoS? (Score:1)
1) Most of the results would probably be "nope, don't got that", or "not really". Just don't bother to send the info in unless you are above a certain level of matching.
2) Even if there were a large amount of results, why could't you sort of decentralize that too? You have your own search client on your computer. No centralized search site. You send out a query to several other computers. They talk amongs themselves, expanding at a geometric rate. The info gets collected to say 20 different nodes throughout the internet. Those 20 different nodes send you just one reference to the html/whatever pages summarizing their results. You display that in your browser. I'm not quite sure if that made sense, but you should get the general idea. Why just distribute a little bit?
Finally about your idea, that would keep entities from returning dynamic content. This might not be a bad thing, as other people have pointed out. I've got an idea bout that too. Somewhere in the distributed computing, you'd have a computer actually search the page to see if it matches what you want. You could still have people falsely strew what you wanted in the page, along with porn and printer cartridge deals, I guess. I also like the idea of "moderation". You could even have a search for more "spamlike instances" in the pages. Search for porn, toner cartride deals, etc (unless that's what you serached for of course), and make these more likely to be moderated. Maybe you could also take away the ability to be in the searches if you get moderated as a spammer too much.
Just some thoughts.
Another Search engine. (Score:1)
It lets you filter the search result,
a type of automatic directory.
Re:Illicit network (Score:1)
Unlike Napster, however, AltaVista, HotBot, Google, etc. can be used to search for pirated software, pornography and blockbuster movies...
Short memories (Score:1)
At least *I* am old enough to remember that this technology has already been
implemented and is now dead because relevancy doesn't scale to an adversarial
Internet.
This sort of system used to be called Archie, remember? Hello?
It stopped being relevant because it was supplanted by something new called a
<finger quote gesture> "search engine" </finger quote gesture>. The problem with Archie was
anyone could claim to have just what you were looking for, so they did. Duh.
Now Andreeson et al say this is great because it "handles dynamic content",
which of course "search engines can't do". The dynamic content they refer to is
a spec sheet for a computer that Dell supposes you are looking for. So the point
is that Dell can just *create* the spec sheet you're looking for in response to
your search, right? Sounds spam-proof to me!
Duh.
long overdue? (Score:1)
Re:Interesting idea with problems (Score:1)
They're still there, and still used correctly by some sites.
There is already a lot of spamming of search engines. And the search engines often aren't that good at weeding out the spammers. Perhaps collaborative trust-based filtering is the way to go. Something like this: Anyone can register a vote for a filter, and the individual filterings are also rated by everyone (meta-moderation in other words). Those with the highest Karma and no reputation of censoring things - and eventually, those most knowlegable in the area and/or those of similar political disposition to the user in question - will tend to be trusted by other users.
(2) This search technology essentially turns a search into an advertising stream. Since the site decides what to return, it'll return a blurb instead of a context around the match. And if the site can returns graphics and not just text strings... oh, my! Advertising banners as search results! Joy.
In some cases, this could actually drive users away. But yes, it is a problem. Filtering would be a good idea here too. The server software could come with a warning attached - "You MUST provide an ALT text alternative to any images, otherwise you will drive away viewers who choose to block images in search results."
(3) The results are going to be dependent on the location of the query. Same question asked from a machine in California is likely to return different results if asked from a machine in Germany (especially with low timeouts). This isn't horrible, but not all that good. In particular, it means that I cannot tell other people "Search for 'foo', you'll find the site I am talking about on the first page".
Well, this already happens. You get different results depending on which search engine or directory you choose.
Re:== Do it yourself DDoS? (Score:1)
No, certain searches should be distributed as well. How many cluebies search for "sex" or "MP3s" on Yahoo every day? Many. What are the results going to be? More or less exactly the same (low-quality) results each time.
Yet these results aren't cached at all because the results of CGI scripts or servlets or other dymanic content providers aren't cached, according to the HTTP specification. This is a big waste. We need smarter protocols, and this idea might be a step in the right direction.
Re:== Do it yourself DDoS? (Score:1)
Don't search engines, search caches! (Score:1)
Add a mechanism or two to let users rate pages - or better, to automatically rate them according to how long they are actually displayed in browsers, if their links are used, content saved, printed, made a part of bookmarks, et cetra...
Run browsers through the engine, and off of URLs/pages read by the distributed search engine, so it can do the above rating, and keep a more appopriate cache, maybe whole smallish pages themselves to return instead of URLs.
Combine the data from several sources to keep someone from skewing the ratings, and in the end-user's node rate the results from other nodes (as above, preferably automatically too) according to usefulness/cluelessness/spamishness. share that info too!
Then you have a distributed search engine worth something.
Simple pirate tools? (Score:1)
> technology, they will be viewed as simply tools
> for pirating or other illegal use.
Puh-leeze. The reason these technologies are viewed as aids to pirates is that they *are* aids to pirates. Not that Lars is the expert, but I tend to believe him when he says of all the Napster traffic they monitored, only a negligible portion was legitimate. Likewise, gnutella turns around and says, we have a tool that does just what Napster does, but can avoid messy lawsuits through decentralization!
Is it any wonder people are skeptical?
Push vs. Poll :-) (Score:1)
Instead of a Gnutella based search engine, it's a Napster-based search engine (mind you, the offerings stay live even when the client isn't connected).
Re:Interesting idea with problems (Score:1)
This would stop some of the potential abuse of the search engine (your points 1 & 2).
As for your third point, you could always use email/icq/aim/irc/etc to send the person the url.
That's pointless... (Score:1)
Mike Roberto (roberto@soul.apk.net [mailto]) -GAIM: MicroBerto
This was done years ago.. (Score:1)
Re:Illicit network (Score:1)
-- kwashiorkor --
Pure speculation gets you nowhere.
Get yer buzzwords here! (Score:1)
Risks explored (Score:1)
Some relevant material from the article The Value of Gnutella and Freenet [webreview.com]:
Old Concept, New application (Score:1)
Re:Starting my IPO (Score:1)
Infra Search source code? (Score:1)
Re: cache between server and browser (Score:1)
_________________________
regex for searching (Score:1)
Here's an Idea (Score:1)
Re:gnutella sucks right now (Score:1)
Re:Faking results? (Score:1)
Ya...exactly...and also look at
Re:Starting my IPO (Score:1)
"<Insert wildly overstated profit estimates here>"
How's that?
Re:DVD (delurk) (Score:1)
Ouch! It hurts!
Please, is this a joke ?
Say it isn't. Slashdot is a site for news and discussion about these news. There is no need for off-topic things. Slashdot readers are not crackers (look at http://www.2600.com [2600.com] instead). You can break code with DeCSS if you find it. Noone's going to help you with that.
The people who told you /. was a 31337 h4xOr w4r3z bunch were misinformed, or lies to you.
Now, say it was a joke. I reply to this because there are not many comment at this time (3, including two off-topic), so there aren't more interesting debate here for now.
This is Great (Score:1)
Tres cool (Score:1)
Uses (Score:1)
Allow Me to Introduce Myself... (Score:1)
Hi. I'm Allan Cox, Open Source advocate, Linux [saltire.org] advocate, and primary coder for Linux's TCP/IP stack. I hope I'm welcome in the SlashDot forums, as til this point, I've been a totally arrogant, antisocial bastard to the community which barely pays for my lifestyle.
In regards to the TCP/IP stack in Linux and my arrogant attitude, I must apologize: as you all already knew, and I just recently admitted to myself, FreeBSD [saltire.org]'s TCP/IP stack is far superior to Linux's, and to top it off, Microsoft [saltire.org] has proven many a time that even the TCP/IP code found in Windows NT [saltire.org] functions better than the drivel I have generated myself. Boy, what a humbler that is! It was like RMS and ESR yelling at me on my own front porch (well it's not really my front porch, it's the landlady's, in front of my one-bed, half-bathroom hovel, but you get the point)!
I'd also like to say, in regards to those who read and post in SlashDot's forums... I am sure I will be seeing Allan Cox. [note the period], Alien Cox, Allan Cocks, Allan Coox, and the like. Please, please, please, for those of you who take SlashDot posting seriously (as I do now, amen!) do not let these crank posters (heretofore to be called "trolls") ruin CmdrTaco's bountiful SlashDot experience! "Trolls" take some delight in confusing the populace and causing disparity in the community. Take the time to learn the real from the fake, as I have (re: how I admitted to myself my TCP/IP stack for Linux actually sucks)!
Thank you.
Only collaborative filtering will prevent this. (Score:2)
--
This looks BAD (Score:2)
Of course, if this were heavily user-moderated, I guess it might just work. But don't hold your breath. I'll be sticking to Google...
old idea: federated search engines (Score:2)
Re:Interesting idea with problems (Score:2)
Sure, but does anybody care? Meta tags are now 'tainted' and no search engine even spares as much as a glance in their direction.
Perhaps collaborative trust-based filtering is the way to go.
I don't understand how this could work. It's doable for a tree-like reference site (Yahoo), but seems impossible for a pure search engine (Google). Let's say I type "support vector machines Vapnik" into the engine -- who and how is going to filter the results so that I don't get "naked and petrified young girls" matches? The only feasible thing seems to be locking out of IP addresses which supply bogus data, but this will get extremely messy extremely fast.
You MUST provide an ALT text alternative to any images, otherwise you will drive away viewers
So instead of a banner I will see "Come to our site for hot chicks, naked and petrified...". I don't think this is going to help much.
I wasn't really concerned with bandwidth. I was concerned with the fact that a search becomes a request to send targeted advertising to you, and nothing more.
Kaa
Re:An important step (Score:2)
www.deja.com [deja.com]
But organization helps, so for now, stick with consumer reports.
--
What is the difference to harvest? (Score:2)
Re:Open Source but not (Score:2)
It is not sad. You're simply parroting Linus Torvalds's "release early and often" advice. This works for an OS, because compatibility problems are the issue of the day. There are also reasons not to do so:
Most people don't take the time to upgrade, so they'll miss out on major features that are added later.
If you release something that is incomplete, people will try it, see that it is incomplete, and have a poor impression from that point on.
If promising software is released at an early stage, there is likely to be much more "cloning activity," channeling effort into doing things again--the right way--instead of tweaking something that's already most of the way there
Re:What is the difference to harvest? (Score:2)
The reason for the Broker server may be because of problems of spam etc. as many have been discussing above.
dns rehash (Score:2)
All these technologies are just DNS all over again. DNS was created to make host information available all over the internet. Here's the difference:
When DNS was set up, it was probably just as easy to try and pirate stuff, and who knows? Maybe people did use the early internet for illicit purposes, but only the few people who were in the know could actually do so. And not much was available on the net anyway. But DNS was created for the exact same purpose as napster and freenet: to make it easy to share information.
Nowadays, the internet is so big that there are lots of people into it only to make money. The possibility of a scam makes people run to see how they can get their share of it, and a technology like this, however innocent, will make the headlines when everyone rushes over to see what scam (and related lawsuit) they can pull off.
All these technologies: freenet, napster-likes, all sorts of things, are incredibly valuable extensions of what already provides structure for the wired world. If someone had thought of them in 1980, we would have a much tighter, distributed internet today.
Well, we've thought of them now. I hope they are allowed to flourish, and that people don't keep just thinking about the negative implications of them. This is the first time I've seen a concrete example of putting it to good use.
I think we should have the right and the possibility to choose to share what we want to:
Imagine all the information that our governments gather from us a la enemy of the state for example: with this kind of network idea, peer to peer, we could all be gathering and sharing that information already, and maybe even doing something positive with it!
Developer's mailing list (Score:2)
Re:What is the difference to harvest? (Score:2)
o Nullsoft creates a search engine based on the technology to legitimize it.
o InfraSearch gets media attention, much fanfare etc.
. Harvest creates a distributed search engine.
. Had you heard of it before now?
It's the same old story, to be heard you have to be contraversial, or rich.
Gotta love the 'net - created for war, popularized by porn and piracy.
Starting my IPO (Score:2)
We're called Search.
Gnutellanet (Score:2)
Think of the children!!! (Score:2)
Please Slashdotters, Think of the Children! and stop stealing the food of the innocent children with this Gnutella technology!
Re:The ‘system’ has brought this on itself. (Score:2)
At the time, as I remember, one major argument against censoring the 'net was that it was nearly impossible to do - "anyone" can post "anything" on the net, and because its so international, no one nation had control.
Oh where have those days of freedom gone? How did the censors get past those barriers? Easily enough, it seems - they have the money and the will to spend it on their self-interest that makes anything possible. In retrospect, our claims of immunity from censorship were naive.
I believe that once systems like Gnutella become popular, it will move (like the original web) from being a geek's haven to corporate tool, and be appropriately restricted by their needs. Maybe less so than the web today, but order will be enforced. How? I don't know, but did we forsee the DMCA and other tactics corps are using to censor the 'net?
Don't worry, by then I'm sure we'll think of something else.
Re:Faking results? (Score:2)
A sort of learning search engine that receive feedback from users ("this site isn't about what it claim to be, don't show it as a possible answer", "this site is excellent and very complete", "this site is nice but unreadable without browser so-or-so", etc). That can be the future. Of course it will need lot of negative feedbacks unbalanced by positives and some other checksto definetly dismiss a site, otherwise we can imagine a way of making "softwar": imagine some corporate making a script sending bad feedback to burry a concurrent website...
Sidenote: IIRC, "Softwar" is the title of a novel, so this wordplay is not mine.
Moderation/site ranking system (Score:2)
I'm definitely not an expert, but collaborative moderation doesn't seem like a bad idea. You could maybe have a separate, probably more centralised moderation server. (Or lots of them if people start running their own.) Users could rank the results they get from any given site, and when others run a search, the reliable replying sites come up first.
There are still lots of problems though, like how to stop anyone from just moderating their site to the top, and how to make sure the responding site is exactly who they say they are.. which is one of the major problems with spam these days anyway. It could also be really tedious working out how to distinguish a good result from a bad result.
Faking results? (Score:3)
Who is enforcing that sites won't just lie? Maybe some sort of collaborative moderation a la Slashdot would be needed?
Open Source but not (Score:3)
More than web data? (Score:3)
This makes pira^H^H^H^H trading files even easier - people no longer need to install a client, there's a nice web-search interface, with direct dload URLs. Web searching for files with no broken links. Nice.
gnutella sucks right now (Score:3)
Will this work? (Score:3)
Just think - you're dialled in to an ISP and want to search for something. Eventually you start getting responses, first from hosts logically closer to you then those further away (we can only hope that there's no negative response in the protocol). You may have to wait for it all to come down the line before you get a useful result. And you'll still have to wade through mountains of useless junk (since responders get to define what content they have) just that now you'll have to actually visit the site to see that it's just another boring article on internet protocols instead of the "fix your credit record" guys you were looking for. Eventually, you'll learn which hosts not to accept responses from and which ones respond better to what types of queries (just like today).
Big search engines will still dominate the field by being able to get it right most of the time. I don't see any real advance.
---
An important step (Score:3)
Re:Simple pirate tools? (Score:4)
The ‘system’ has brought this on itself. (Score:4)
Distributed searching. (Score:4)
What's to stop people from spamming the index?
I suppose they could build in a little technology to actually check the page. On the other hand, anything you do can be circumvented.
I suppose this is the classic downside to the entire Internet "thing". You can't enforce absolute control in a medium specifically designed against it. Of course, there are a few things you could do to help the situation.
With a Gnutella-style model for distributed searches, any host that is consistently returning false positives could be cut off by the adjacent node(s), right? If you have tons of traffic coming through your node from a spam site, couldn't you just stop forwarding requests to them.
Of course, this wouldn't stop all spamming on the index, but it should allow any one node to cut off a spam node "below" itself. On the other hand, since not everyone will be eternally vigilant, this much freedom could be damaging.
You could always have something like the MAPS RBL for search nodes. Just have someone paying attention that can keep a database of hosts to ignore requests from. If anybody can create a blackhole list, it wouldn't necessarily be centralized, so it wouldn't impinge on freedom of the search. It may still have an "open relay" problem, like SMTP does now, but that doesn't necessarily make it not worthwhile.
== Do it yourself DDoS? (Score:4)
Example: If you get the results of this kind of broadcast search back from a bad search ("sex nude pictures jpg"), you'll trash your own internet connection and probably that of others (or the search-interface's if you use a web-interface).
Imagine a network of a million hosts (a small subset of all webservers). Each of these is running a gnutella-based search-engine. On one of the servers is an interface to search the network for some information. The query is forwarded onto the overlay network, to say 10 nodes at each node, assuming some mechanism is in place to avoid loops. if the network is well interconnected, it will take about 5-6 hops to reach an edge of the cloud (probably a couple of times more to reach all the nodes). As soon as the first nodes get the search-request, they send back results, say limited to the first 5-10 most significant hits. Each reply has a number of tuples consisting of (URLs, a description and an indication of how close the match is and a timestamp and probably some more), maybe 1-2 kB per reply. Say 10% of servers have a match, then 100000 hosts will at some point send back results.
I calculate, roughly a 100 MB of results will be arriving at the searching node within a few minutes, if it can process the dataflow
This is only one search, both the searching nodes and the servers will have to deal with a lot of searches if you look at other search-engines as a comparison.
Centralised search-engines are a good way to limit the bandwidth-usage, but they are slow to get changes on the web.
idea: It would be good to have a webserver keep track of an index for it's own document-space and when that changes, push that change to a central search-engine where it can be searched. Distributing the searches is a waste of resources, IMHO you should distribute the indexing mechanism and centralise the searching.
And considering that for this thing to work you need an index-engine on each server anyway, it's a small step to do it like this, isn't it?
Interesting idea with problems (Score:5)
(1) An obvious point: if a site itself decides which queries to respond to, there'll be a lot of spamming the index. Doesn't anybody remember the fate of the [meta] tags?
(2) This search technology essentially turns a search into an advertising stream. Since the site decides what to return, it'll return a blurb instead of a context around the match. And if the site can returns graphics and not just text strings... oh, my! Advertising banners as search results! Joy.
(3) The results are going to be dependent on the location of the query. Same question asked from a machine in California is likely to return different results if asked from a machine in Germany (especially with low timeouts). This isn't horrible, but not all that good. In particular, it means that I cannot tell other people "Search for 'foo', you'll find the site I am talking about on the first page".
Out of the three, the first is so obvious, something will be done about it. I don't know what, though. It's the second that worries me most of all. Besides more advertising, there is a basic problem here -- I want to see what the site has, not necessarily what they prefer to show me. To give a trivial example, a company could have a recalls/warnings/manufacturing defects page somewhere on its site to satisfy disclosure requirements, but never return this page to any search.
All in all, I'll stick with Google for the time being, thank you very much.
Kaa
Illicit network (Score:5)
"The Department of Transportation released a shocking report this morning, in which it was discovered that the federal highway system, unlike rural routes, allow transportation of any kind of material. A random sampling of items being transported at any given time ranges from pirated music to pirated blockbuster movies, to pornography."