Forgot your password?
typodupeerror
The Internet Businesses Google

On Finding Semantic Web Documents 67

Posted by michael
from the location-location-location dept.
Anonymous Coward writes "A research group at University of Maryland has published a blog describing the latest approach for finding and indexing Semantic Web Documents. They have published it in reaction to Peter Norvig's (director of search quality at Google) view on the Semantic Web (Semantic Web Ontologies: What Works and What Doesn't): 'A friend of mine [from UMBC] just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn't find them all. I looked, and it turns out there's only around 200,000 of them. That's about 0.005% of the web. We've got a ways to go.'"
This discussion has been archived. No new comments can be posted.

On Finding Semantic Web Documents

Comments Filter:
  • by Simon Brooke (45012) * <stillyet@googlemail.com> on Friday January 14, 2005 @05:55PM (#11368557) Homepage Journal

    It's not about the filename extension (if any), silly. It's about the data. Valid RDF data may be stored in files with a wire range of extensions, or even (how radical is this?) generated on the fly.

    What matters is first the mime type (which is most likely application/xml or preferably text/xml), and the data in it.

    Oh, and, First Post, BTW.

    • Preferably application/rdf+xml . Anything else is not appropriate for RDF-serialized triples. text/xml and application/xml are both wrong for this kind of data.

      This will become more important as resources are represented in multiple ways, for tools to consume: they ask for a specific type, and fallback may fall the wrong way if people start telling their webservers that RDF is something it's not.
    • I work at one of the few places that crawls billions of URLs each month, and I observed exactly the same thing as Peter. There just isn't that much xml/rdf/daml/owl on the web. At the point when we had crawled 6 billion URLs, I found only 180,000 URLs that had a mime type or extension to indicate that they were machine-readable metadata.

      The reason is something that people in the semantic web community are loathe to talk about - that there isn't enough incentive for people to create metadata that they put
  • What about... (Score:4, Insightful)

    by Apreche (239272) on Friday January 14, 2005 @05:55PM (#11368567) Homepage Journal
    What about all the pages that are .rss but are actually rss 1.0, those are rdf-based. And what about all the rdf which is in the comments of .html files and others? My creative commons license is rdf, but its inside a .html file. Sure, we do have a long ways to go, but the semantic web is bigger than a few file extensions findable by google.
  • unexpected? (Score:3, Insightful)

    by AnonymousCactus (810364) on Friday January 14, 2005 @06:02PM (#11368635)
    Without a large number of widely used tools out there that make use of semantic information there won't be that much content designed for them...and without content designed for them the tools won't exist and certainly won't be widely used. Currently it's more of an academic exercise - if we somehow knew what all this information on the web actually was, what could we do with it? More interesting it seems then are approaches at bypassing the markup by hand and do something equivalent automatically.
  • by faust2097 (137829) on Friday January 14, 2005 @06:02PM (#11368636)
    Semantic web stuff if cool and all but I honestly don't believe that it will ever really take off in any meaningful way. For one, it takes a paradigm that people know and understand and adds a lot of complexity to it, both on the user end and the engineering end.

    Plus a lot of the rah-rah booster club that's grown up around it sound a whole lot like the Royal Society folks in Quicksilver who keep trying to catalog everything in the world into a 'natural' organization.

    What it basically comes down to for me is that it seems like a great framework for single-topic information organization but at a point we need to keep our focus on the actual content of what we're producing more than the packaging. For this to be ready for prine time the value proposition needs to move from a 30-minute explanation involving diagrams and made-up words ending in '-sphere' to something even less than an "elevator pitch" like 2 sentences.
    • Two sentences, eh? (Score:3, Interesting)

      by misuba (139520)
      You're on.

      1) A simple human- and machine-readable schema is defined for marking up descriptions of items for sale or wanted.
      2) Google learns how to read them, thereby putting eBay, Craigslist, and other sundry companies out of business and putting your data back in your hands.

      Okay, so the second sentence is a bit of a run-on, and this use case has a whole lot of hairy details I'm leaving out. But the possibilities are pretty exciting nonetheless.
      • The problem is, that doesn't require the semantic web or any sort of semantic technologies.

        A simple well-formed XML document will suffice and be simpler to write. And if you *really* want to make it fit into the semantic web, you can provide an XSLT file that translates it to RDF. The problem is that RDF is wounded by having an incredibly ugly syntax.

        Furthermore, the simple model of posting XML or RDF documents and google not only magically finding them but "putting the data back in your hands" is flawe
        • by mike_sucks (55259)
          "The problem is, that doesn't require the semantic web or any sort of semantic technologies."

          You're right of course, but for any such initative to be successful, it needs to use a standard (or at least widely-known and stable) format/grammar/etc so that thrid-party systems can understand your data.

          This is where RDF, OWL and the other semamtic web technologies come into it. Why invent another system when there is already one there?

          "The problem is that RDF is wounded by having an incredibly ugly syntax."

          N
          • The problem with RDF is that they are putting really technical terms on really simple ideas and nobody's done an especially good job of distilling it down to the most basic level in such a way that anybody can program it. It's not just that the format's ugly, it's that, as far as I can tell, the vast majority of folks who actually are in a position to output semantic information suffer eye-glaze-over when they try to understand RDF.

            Furthermore, the problem is not just that you need to have the tools outpu
            • "as far as I can tell, the vast majority of folks who actually are in a position to output semantic information suffer eye-glaze-over when they try to understand RDF."

              That's interesting, do you have some figures to back that up? I'm a generic web developer from a small city in a backwards country, and I get it. Are you saying that RDF is harder to learn then say, writing a POSIX or Windows application? A FPS? A Java web-app? A Linux kernel module? Because plenty of people do those things, every day.

              "Furth
              • *one* RDF-based technology (And only for part of the market) in the past several years means that we'll start having real semantic applications in 2105 or so.

                Remember, there's nothing intelligent about syndication. It fits just as well into the "well formed web" as it does the "semantic web". All RDF does is make things much more verbose than it is otherwise. The whole point of the semantic web was so that I could view a RDF Site Summery file and have my web browser automatically figure it out and link
                • "Remember, there's nothing intelligent about syndication. It fits just as well into the "well formed web" as it does the "semantic web"."

                  Sure, syndication is a specialization of the semantic web in general, but when you come down to it, that's all the semantic web is: machine readable information about a web-accessable resource. There's nothing special or crazy or obtuse going on, it's that simple.

                  "But then, your average web designer isn't thinking about RDF, so the message is either getting lost or ignor
      • But I like craigslist better than I like Google. And the quality of that data in my hands is dependent upon the internet community at large AKA those people who write gay erotic fan fiction about Star Trek characters.

        Besides, Google's pagerank has been owned by 'optimizers' for years. I don't trust them any more than any other commercial enterprise.

        And FOAF is the closest the 'blogosphere' has come yet to physically jerking each other off.
      • 1) A simple human- and machine-readable schema is defined for marking up descriptions of items for sale or wanted.

        How does a typical shop keeper learn to do this and apply it to their wizard-made web page?

        The second point about Google is fairly ripe for abuse. Meta tags in HTML were mostly rendered useless by porn sites, for example. Also, sites like eBay tend to concentrate useful information in useful ways, while feeding keywords to Google can often be frustrating for anything remotely generic.
  • Norton Antivirus got to do with this web technology?
  • by crschmidt (659859) on Friday January 14, 2005 @06:16PM (#11368769) Homepage Journal
    Every user of a LiveJournal-based website running recent code has a FOAF file. Let's look how many users that is:

    * LiveJournal.com: 5751567
    * GreatestJournal.com: 717406
    * DeadJournal.com: 474435
    * Weedweb.net: 22650
    * InsaneJournal.com: 12970
    * JournalFen.net: 7629
    * Plogs.net: 7086
    * journal.bad.lv: 4530

    (This list is most likely incomplete.)

    In addition to this, every Typepad user has an account: according to the 6A merger stories, that's another million users. Add in the RDF from all the Typepad RSS files, and that's another 1 million.

    All Wordpress blogs have a feed, located at /feed/rdf or /wp-rdf.php, which is in RDF. Movable Type comes preinstalled with an RSS 1.0 feed. Each of these has at least a couple thousand users.

    So, we've got, just as a guess, about 9 million RDF files out there in the blogging world alone. Throw in a hell of a lot of scientific data, and everything on RDFdata.org [rdfdata.org], and you start to get an idea that the world is a lot more Semantic Web enabled than you seem to think it is.
    • About 75% of those that signed up for those various blogging services have never actually posted a single entry in their blog. So the actual numbers is more like 2.2 million of so. Even with a devistating hit like that it's still 10 times more that the number stated in the article though....lol...and its still just the bloggers alone.
    • So, we've got, just as a guess, about 9 million RDF files out there in the blogging world alone.

      Care to venture a guess as to how many of those actually contain useful information? Really, who cares if Melanie in Oshkosh really, really loves Justin Timberlake, or Winthorpe in Des Moines really, really wants people to sign up so he can get an Ipod?

      Furthermore, once you start tying all this information together, doesn't that just make the work for corporate data miners just that much easier?

      Of course, you
  • A few sites I have worked on that are run by MKDoc [mkdoc.org] are listed in their top 500 [umbc.edu], since MKDoc generates a RDF metadata file for every HTML document, but the biggest and most interesting are missing, I expect that there are perhaps several hundred times more RDF documents out there than they have found...

  • From the article: For each new site S we encounter, we give Google the narrower query 'filetype:owl site:S', which will often end up getting some additional results not included in the earlier query.

    From the Google TOS: [google.com] You may not send automated queries of any sort to Google's system without express permission in advance from Google.

    I am serious. These researches just used a lot of resources from Google that they had no permission to use. Researchers especially should try to be good citizens on the n

    • by Anonymous Coward
      I happen to know the members of the ebiquity lab personally, I was a grad student in another lab at UMBC. Just to clarify what the person who wrote this post completely made up, the ebiquity people actually had an account with Google to use the web APIs for use in Swoogle. I know this because at the time they were working on it, Google was having some sort of technical problem and they couldn't register for a license key. So they asked around if anyone had a valid key that they weren't using and that ebiqui
      • I must have been mislead by statements such as:
        We never did get any help from Google. -- Tim Finin [umbc.edu]
        and statements like: Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted. -- Google.

        You may be right, they may have had permission, but all I see is complaints against google for no cooperation--which I would count special permission to use their database as being.

        At any rate, if I misunderstood the Tim Finin, then I do a
        • Perhaps I should have clarified how Google didn't help. I had asked Peter Norvig if we might be able to get all of the .owl, .rdf, .rdfs, etc files. He said he'd check, but we never heard back. This was during the early stage of the runup to the IPO, so I was neither upset nor surprised. If Google started helping all of the web hackers in the world, they'd never get anything done. Besides, it gave us a new problem to solve and left us with a ice warm feeling after we did. I'm pretty sure our use of the
    • The google TOS [google.com] you are talking about is for the google website. We had used the google webservice api, please read the google api TOS [google.com].
      Google api was built to allow automated queries so we were not "violating" the TOS.

      So I think it is wrong on you part to comment on some one without having the full information.Ofcourse it may take longer and require more work, but that seems better than using wrong information.
    • I would call that stealing, except I wont because that will start a whole other thread thelling me that information cannot be stolen.

      Too late! In that case Google is the bigger thief with all the cached content it has stored and provides through its own servers.
    • Nah. This would be a valid example of Fair-Use, that everyone here likes to bandish about.
  • Apart from RSS feeds, how can I use this data? I mean, I have RDF metadata available for pretty much every page on my website, but I haven't yet noticed anyone who actually reads it.

    The semantic web seems like a good idea in principle, but I would really like to know just how I could use it in real life! Seriously, can anyone name a useful tool that relies on RDF feeds (again, aside from RSS-style stuff) or propose one that could? Perhaps if I saw a real application of the semantic web I would actually und
    • Check out http://crschmidt.net/semweb/ for info on some of the projects I've worked on which use the semantic web.

      The most interesting one, in my opinion, is lorebot [crschmidt.net]. Lorebot sits in a channel, and associates identified users to their FOAF files. Once it does this, it links them to a human readable description of HTMl about them, and, if possible, displays an image for them. Example output: online users [crschmidt.net], personal output [crschmidt.net].

      There's also things like the FOAF or DOAP a matic: both of which take RDF and spit out
  • by crazyphilman (609923) on Saturday January 15, 2005 @01:24AM (#11371671) Journal
    I think the "Semantic Web" sounds great on paper, and is the next big thing in university research departments and etc, etc, BUT I don't think it's going to end up seeing wide use. Here are my reasons, basically a list of things that I as a web developer would hesitate on.

    1. The Semantic web seems to require a lot of extra complexity without much "bang for my buck". If I build a page normally, all my needs are already met. I can submit the main web page to search engines, prevent the rest from being indexed, figure out how to advertise my 'page's existence... I'm pretty much set. The extra stuff doesn't buy me anything. In fact, I definitely would NOT want people being able to find information on my site without going through my standard user interface. I WANT them to come in through the front door and ask for it.

    2. Let's say people start using this tech, which I imagine would involve all sorts of extra tagging in pages, extra metadata, etc. Now you have to trust people to A) actually know what they're doing and set things up properly, which is a long shot at best, and B) not try to game the system somehow. On top of that, you have to trust the tool vendors to write bug-free code, which isn't going to happen. What I'm saying is that all these extra layers of complexity are places for bugs, screw-ups, and booby traps to hide.

    3. And, the real beneficiary of these sorts of systems seems to be the tool vendors themselves. Because what this REALLY seems to be about is software vendors figuring out a new thing they can charge money for. Don't write those web pages using HTML, XML, and such! No, code them up with our special sauce, and use our special toolset to bake them into buttery goodness! Suddenly, you're not just writing HTML, you're going through a whole development process for the simplest of web pages.

    Maybe I'm getting crusty in my old age, but it seems that every single year, some guy comes up with some new layer of complexity that we all "must have". It's never enough for a technology to simply work with no muss and no fuss. Nothing must ever be left alone! We must change everything every year or two! Because otherwise, what would college kids do with their excess energy, eh?

    Sigh... Anyway, no matter what you try and do to prevent the Semantic Web from turning out just like meta tags, the inevitable will happen. You watch.

    • by l0b0 (803611)

      The Semantic web seems to require a lot of extra complexity without much "bang for my buck". If I build a page normally, all my needs are already met.

      How about the needs of the people actually using the page? If you don't care about the viewers, why bother putting it on the web?

      I definitely would NOT want people being able to find information on my site without going through my standard user interface. I WANT them to come in through the front door and ask for it.

      That sounds just like the kind of site I g

      • "How about the needs of the people actually using the page? If you don't care about the viewers, why bother putting it on the web?"

        Anything more complex than flat HTML is actually going to require the developer to retain some control over how the user views the pages. For example, take a page that allows you to submit an application online. The only appropriate place for a user to start is the start page of the application. NATURALLY I'm going to bounce you back to the beginning.

        Anytime you try to do ANYT

Maternity pay? Now every Tom, Dick and Harry will get pregnant. -- Malcolm Smith

Working...