Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
The Internet Businesses Google

On Finding Semantic Web Documents 67

Anonymous Coward writes "A research group at University of Maryland has published a blog describing the latest approach for finding and indexing Semantic Web Documents. They have published it in reaction to Peter Norvig's (director of search quality at Google) view on the Semantic Web (Semantic Web Ontologies: What Works and What Doesn't): 'A friend of mine [from UMBC] just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn't find them all. I looked, and it turns out there's only around 200,000 of them. That's about 0.005% of the web. We've got a ways to go.'"
This discussion has been archived. No new comments can be posted.

On Finding Semantic Web Documents

Comments Filter:
  • by Simon Brooke ( 45012 ) * <stillyet@googlemail.com> on Friday January 14, 2005 @06:55PM (#11368557) Homepage Journal

    It's not about the filename extension (if any), silly. It's about the data. Valid RDF data may be stored in files with a wire range of extensions, or even (how radical is this?) generated on the fly.

    What matters is first the mime type (which is most likely application/xml or preferably text/xml), and the data in it.

    Oh, and, First Post, BTW.

  • What about... (Score:4, Insightful)

    by Apreche ( 239272 ) on Friday January 14, 2005 @06:55PM (#11368567) Homepage Journal
    What about all the pages that are .rss but are actually rss 1.0, those are rdf-based. And what about all the rdf which is in the comments of .html files and others? My creative commons license is rdf, but its inside a .html file. Sure, we do have a long ways to go, but the semantic web is bigger than a few file extensions findable by google.
  • unexpected? (Score:3, Insightful)

    by AnonymousCactus ( 810364 ) on Friday January 14, 2005 @07:02PM (#11368635)
    Without a large number of widely used tools out there that make use of semantic information there won't be that much content designed for them...and without content designed for them the tools won't exist and certainly won't be widely used. Currently it's more of an academic exercise - if we somehow knew what all this information on the web actually was, what could we do with it? More interesting it seems then are approaches at bypassing the markup by hand and do something equivalent automatically.
  • by dracocat ( 554744 ) on Friday January 14, 2005 @07:33PM (#11368925)
    From the article: For each new site S we encounter, we give Google the narrower query 'filetype:owl site:S', which will often end up getting some additional results not included in the earlier query.

    From the Google TOS: [google.com] You may not send automated queries of any sort to Google's system without express permission in advance from Google.

    I am serious. These researches just used a lot of resources from Google that they had no permission to use. Researchers especially should try to be good citizens on the net and not do tons of automated querying to websites without permission--especially when it is specifically prohibited.

    Google has spent a lot of time and money to get the information that they wanted; and when asked for copies of it google didnt give it to them--so instead they just took it without permission.

    I would call that stealing, except I wont because that will start a whole other thread thelling me that information cannot be stolen.

    My point is, if you want to do research, at least play by the rules that you are given. It may take longer and require more work, but that seems better than using information that you dont have permission to use.

  • by mike_sucks ( 55259 ) on Saturday January 15, 2005 @12:48AM (#11371291) Homepage
    "The problem is, that doesn't require the semantic web or any sort of semantic technologies."

    You're right of course, but for any such initative to be successful, it needs to use a standard (or at least widely-known and stable) format/grammar/etc so that thrid-party systems can understand your data.

    This is where RDF, OWL and the other semamtic web technologies come into it. Why invent another system when there is already one there?

    "The problem is that RDF is wounded by having an incredibly ugly syntax."

    No, you're confusing the XML syntax with the model. RDF isn't the XML format, that's just one way to serialise an RDF graph. You also have the n-triple format and others. Ideally, XML serialised RDF would never be hand-written, it _is_ a pig to do so. But is a convenient way to dump an RDF graph in such a way that it can be reliably machine read (which is the whole point of the semantic web).

    Notice also how OPML, RDF2 and Atom are never generated by hand? Given the format is best generated by a computer and best consumed by a computer, what's the problem with the format?

    It would be nice if there were one canonical way to serialise the graph, thus making processing with tools that aren't RDF-aware easier (eg, an XSLT processor), but I don't think that is a show-stopper.

    "I think the biggest problem is that you can't trust metadata blindly. And most of the big "semantic web" stuff assumes that you can, or that figuring out trust can be "solved"."

    So why are places like Ebay, Amazon and so on trusted? How is buying something directly via the Ebay web interface any different from buying it via Google, which picked up the same auction from EBay's RDF feed?

    There's a lot of places where trust comes into it. Do you trust Google? Do you trust EBay? Do you trust the seller? The semantic web doesn't solve this problem, but it can make it much easier for you to locate the thing in the first place.

    "And there is metadata that's trusted. EXIF tags are trusted, simply because there's no benefit in lying."

    Right, so why not make the EXIF data available via RDF anyway? Even though it can't be trusted, at least I would be able to search for images that proport to be of a sunset taken between between 17:00 and 18:00, using a Canon IXUS II? That's more than I can do now.

    "If you don't have RDF-format data, can't send out jack-booted thugs to force people to make RDF-format data, and need some of it to make things work properly, you need to figure out how to generate it."

    Well, that's main problem the sematic web faces today, lack of tool support. Why doesn't Dreamweaver people to embed Dublin Core RDF into every document it produces? Why don't the endless numbers of slide-show gallery generators do the same for EXIF data?

    Note however that this isn't a problem caused by the XML RDF serialisation format.

    "The problem is, nobody's bothered to work on a tool to make tuple-spidering code to generate tuples for RDF."

    Honestly, what's the point? It would be much more producive to refit authoring and content management tools so they produce RDF and search engines and the like to consume RDF. We'd me much better off.
  • by crazyphilman ( 609923 ) on Saturday January 15, 2005 @02:24AM (#11371671) Journal
    I think the "Semantic Web" sounds great on paper, and is the next big thing in university research departments and etc, etc, BUT I don't think it's going to end up seeing wide use. Here are my reasons, basically a list of things that I as a web developer would hesitate on.

    1. The Semantic web seems to require a lot of extra complexity without much "bang for my buck". If I build a page normally, all my needs are already met. I can submit the main web page to search engines, prevent the rest from being indexed, figure out how to advertise my 'page's existence... I'm pretty much set. The extra stuff doesn't buy me anything. In fact, I definitely would NOT want people being able to find information on my site without going through my standard user interface. I WANT them to come in through the front door and ask for it.

    2. Let's say people start using this tech, which I imagine would involve all sorts of extra tagging in pages, extra metadata, etc. Now you have to trust people to A) actually know what they're doing and set things up properly, which is a long shot at best, and B) not try to game the system somehow. On top of that, you have to trust the tool vendors to write bug-free code, which isn't going to happen. What I'm saying is that all these extra layers of complexity are places for bugs, screw-ups, and booby traps to hide.

    3. And, the real beneficiary of these sorts of systems seems to be the tool vendors themselves. Because what this REALLY seems to be about is software vendors figuring out a new thing they can charge money for. Don't write those web pages using HTML, XML, and such! No, code them up with our special sauce, and use our special toolset to bake them into buttery goodness! Suddenly, you're not just writing HTML, you're going through a whole development process for the simplest of web pages.

    Maybe I'm getting crusty in my old age, but it seems that every single year, some guy comes up with some new layer of complexity that we all "must have". It's never enough for a technology to simply work with no muss and no fuss. Nothing must ever be left alone! We must change everything every year or two! Because otherwise, what would college kids do with their excess energy, eh?

    Sigh... Anyway, no matter what you try and do to prevent the Semantic Web from turning out just like meta tags, the inevitable will happen. You watch.

  • by l0b0 ( 803611 ) on Saturday January 15, 2005 @05:14AM (#11372093) Homepage
    The Semantic web seems to require a lot of extra complexity without much "bang for my buck". If I build a page normally, all my needs are already met.

    How about the needs of the people actually using the page? If you don't care about the viewers, why bother putting it on the web?

    I definitely would NOT want people being able to find information on my site without going through my standard user interface. I WANT them to come in through the front door and ask for it.

    That sounds just like the kind of site I get pissed off at, when being redirected to the main page after finding the page I really want via Google. Forcing visitors to jump through hoops has never been popular.

    Now you have to trust people to A) actually know what they're doing and set things up properly, which is a long shot at best, and B) not try to game the system somehow.

    As a web developer, you probably already know what kinds of ugly designs there are out there. And yet, by some kind of magic, there are companies which create searchable indexes of these pages, and it just works [google.com]. One of the benefits of this technology I expect to see in search engines shortly, is the possibility of semantic searches. How would you go about, today, looking for a bike magazine called "Encyclopedia" (I've tried)? Or research resultat relevant to your latest blog entry? Or the cheapest direct or indirect first class return ticket from London to New Delhi departing between one hour from now and 9 a.m. Thursday, with return between three and five days later, no smoking all the way?

I've noticed several design suggestions in your code.

Working...