Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
Supercomputing The Internet

Collaborative Map-Reduce In the Browser 188

Posted by kdawson
from the suercomputer-on-the-very-cheap dept.
igrigorik writes "The generality and simplicity of Google's Map-Reduce is what makes it such a powerful tool. However, what if instead of using proprietary protocols we could crowd-source the CPU power of millions of users online every day? Javascript is the most widely deployed language — every browser can run it — and we could use it to push the job to the client. Then, all we would need is a browser and an HTTP server to power our self-assembling supercomputer (proof of concept + code). Imagine if all it took to join a compute job was to open a URL."
This discussion has been archived. No new comments can be posted.

Collaborative Map-Reduce In the Browser

Comments Filter:
  • Random Thoughts (Score:5, Interesting)

    by AKAImBatman (238306) * <akaimbatman AT gmail DOT com> on Tuesday March 03, 2009 @03:51PM (#27056153) Homepage Journal

    Two comments:

    1. He places the map/emit/reduce functions in the page itself. This is unnecessary. Since Javascript can easily be passed around in text form, the packet that initializes the job can pass a map/emit/reduce function to run. e.g.:

    var myfunc = eval("(function() { /*do stuff*/ })");

    In fact, the entire architecture would work more smoothly using AJAX with either JSON or XML rather than passing the data around as HTML content. As a bonus, new types of jobs can be injected into the compute cluster at any time.

    2. Both Gears and HTML5 have background threads for this sort of thing. Since abusing the primary thread tends to lock the browser, it's much better to make use of one of these facilities whenever possible. Especially since multithreading appears to be well supported by the next batch of browser releases [owensperformance.com].

    (As an aside, I realize this is just a proof of concept. I'm merely adding my 2 cents worth on a realistic implementation. ;-))

    • by Briden (1003105)
      best comment on TFA:

      I think this approach to MapReduce is a pretty creative angle to take on it. However, there are a number of distributed systems-type problems with doing it this way, that would need to be solved to actually make this realistically possible:

      1) The dataset size is currently limited by the web server's disk size.
      Possible solution: push the data to S3 or some other large store.

      2) There is a single bottleneck/point-of-failure in the web server. In theory 10,000 clients could try to emit their map keys all at once to the web server. IIRC, Google's mapreduce elects nodes in the cluster to act as receivers for map keys during the map/sort phase.
      Possible solution: Again, if you were using S3, you could assign them temporary tokens to push their data to S3 -- but that would be a large number of S3 PUT requests (one per key).

      3) Fault-tolerance -- what happens when a node in the browser compute cluster fails for any of N reasons? How does the web server re-assign that map task? You'd especially want to ensure that computation finishes on a job in an unknown environment such as 1,000,000 random machines on the internet.
      Possible solution: If you haven't heard from a node in N seconds, you could reassign their map task to someone else. This is a similar idea to the MapReduce paper's description of sending multiple machines on a single map task, and racing them to the finish.

      4) Security -- there is no way to deterministically know whether the data emit()ed from a user's browser session is real or not. How do you trust the output of 1,000,000 users' Javascript browser executions (I think the answer is, you don't).

      • by AKAImBatman (238306) * <akaimbatman AT gmail DOT com> on Tuesday March 03, 2009 @05:08PM (#27057139) Homepage Journal

        Further down in the Slashdot comments, a poster also pointed out that Javascript is a poor platform for computationally intensive work. Which I agree with on a general level. The Javascript number system is designed for genericity, not performance.

        In the end this is just a cute idea that has any number of practical problems. Many of them reflect the fact that distributed computing is hard, but many of them also reflect the fact that the suggested platform is less than ideal for this function. Especially if you're going to be pushing workloads that take more time and resources to transmit back and forth than to simply compute them.

        Doesn't stop me from humoring him, though. We all have to dream. ;-)

        And besides, this may just inspire the next fellow down the line to use the technology for a more practical purpose.

        • man, this idea is sooooooooooooooo old.... also, this example from 2005 come with performance tests. http://blog.mininova.org/articles/2005/11/17/mininova-the-javascript-cluster/ [mininova.org]
        • by chrish (4714)

          Unity [unity3d.com] (a game development platform) translates JavaScript (and Python) into .NET CLR opcodes and then runs them via Mono, which ends up being quite a bit faster than just running the JavaScript in a traditional interpreter.

          SquirrelFish (a bytecode interpreter in WebKit) and V8 (a native JIT compiler in Chrome) are also available to speed things up.

          Looking at it from the .NET CLR bytecode perspective, Silverlight 2.0 is available on Windows platforms and OS X (Intel); once Moonlight hits 2.0, that'll make a

      • by fractoid (1076465)
        Problems 1 and 2 are based on the faulty assumption that "a web site" is a single piece of hardware, whereas for a system like this it would obviously be a server farm connected to the intarwebz by several high-volume dedicated gateways.

        Problem 3 has to be solved in any real-world implementation, so is pretty obvious (and tractable with timeouts etc) imo.

        Problem 4 is the tough one - you can either slug your performance hard by running calculations multiple times, or you figure out some way to authentica
    • AJAX + self updating js saved in cookies

      • AJAX + self updating js saved in cookies

        Local Storage APIs would probably work better. The entire data set could even be dumped to local storage to allow recovery from browser failures. In addition, using the SQL engine of the Local Storage database can speed up certain sorting and aggregation tasks, thus (potentially) allowing for a faster response than making Javascript do all the heavy lifting.

  • Botnet (Score:4, Insightful)

    by ultrabot (200914) on Tuesday March 03, 2009 @03:51PM (#27056159)

    Imagine how much *spam* you could send using this approach.

    No, wait...

    • Re:Botnet (Score:5, Insightful)

      by MonoSynth (323007) on Tuesday March 03, 2009 @03:58PM (#27056251) Homepage

      With ever-increasing JavaScript performance, there's a lot of cpu power available for cracking passwords and captcha's... Just include the code in an ad and you're done. No tricky installs needed, just the idletime of the user's web browser.

      • by Wee (17189)
        My CPU time isn't idle. It's keeping my laptop from being too hot to touch and too noisy to work on. And there's no reason to pay more for electricity than I already do.

        -B
        • by Chabo (880571)

          I really don't think laptops were designed to run at 100% all the time anyway, so yeah, I'd avoid any distributed computing projects on your computer.

          I run it on my two desktops at home though, and there's barely any difference in my electric bill. Idle vs load for me is about 40W difference -- I could save more by turning off a fairly dim bedside lamp.

          • by Wee (17189)
            Yeah, a laptop is definitely not a folded @ home type platform.

            I configured my desktop machine at home to suspend when I hit the power button. I only use it for games, so it's never fully powered on throughout the day. My electric usage would definitely go up a bit if it was always powered on running compute-intensive software.

            -B

      • by euxneks (516538)
        This makes me hate ads even more. I am so glad I use adblock plus right now.
      • by merreborn (853723)

        With ever-increasing JavaScript performance, there's a lot of cpu power available for cracking passwords and captcha's... Just include the code in an ad and you're done. No tricky installs needed, just the idletime of the user's web browser.

        This is eerily plausible, but I think there's one thing keeping this from becoming a massive problem:

        Anyone running a legitimate site will kick their advertiser to the curb if their ads start sucking down lots of CPU. The only people who'd allow this sort of advertising

  • Join compute cloud (Score:5, Insightful)

    by Imagix (695350) on Tuesday March 03, 2009 @03:51PM (#27056161)
    We already have that. See botnets.
  • BOINC (Score:5, Insightful)

    by Chabo (880571) on Tuesday March 03, 2009 @03:55PM (#27056205) Homepage Journal

    If you were really interested enough to donate your CPU cycles, is it really that much harder to install BOINC, and get a job running?

    Plus then you can run native code instead of having to run in [shudder]Javascript[/shudder].

    • BOINC is quite possibly the single worst bit of software I've ever seen. It's kind of like the team did a detailed study of the best practices for software usability and then did the exact opposite.

      • by Chabo (880571)

        If you're talking about the UI, then I'll agree it needs a bit of work, but then it is still a "nerd project" at this point. With any nerd project, the interface is at the bottom of the TODO list.

        If you're talking about the code, care to explain? I've never looked at it.

        • by D Ninja (825055)

          With any nerd project, the interface is at the bottom of the TODO list.

          Which is exactly why many people won't use many of these tools even if the tool is "better." UI is extremely important for most people to consider using a tool.

    • I don't get what's the big problem people have with "[shudder]JavaScript[/shudder]".

      It's a Turing-complete language, which means it can be used to do anything from simple form validation to ray tracing and neural net simulations.
      With AJAX to handle file interactions, I don't understand the problem that people have with it. What is it that you think JavaScript can't do that 'x' language can?

      I wish people would get over this childish bias and accept that JavaScript is a /real/ language, and not

      • Re: (Score:3, Insightful)

        by Chabo (880571)

        A big thing is the same thing people have against VB: there may not be anything technically wrong with it, but bad programmers are drawn to it because it's easy, so you hardly ever see a good VB program. There's especially nothing wrong with VB now, when writing a program in VB.NET gets you the same result as if you'd written it in C#: you still get CIL code when it's compiled.

        However, Javascript gets used for way too much, and historically it's been a huge browser security issue. Even if you use it respons

        • by Blakey Rat (99501)

          I'd like to see Javascript elsewhere. In the browser, it's limited by that turd known as DOM... imagine what Javascript could do if it had libraries that weren't utter shite. It could easily take over all the tasks done by Lua now, and possibly most of Python and Ruby as well.

          The problem is people get into web development, find out that DOM is crap, then they assume the problem is Javascript and not DOM. JS is fine; DOM would be just as crap if you were working with it in Python.

          • by barzok (26681)

            imagine what Javascript could do if it had libraries that weren't utter shite

            It's certainly not the exact opposite of "utter shite" but JavaScript on Windows via Windows Script Host has lots of libraries immediately available which makes a lot of tasks on Windows (including administration) much easier via the FileSystemObject, WMI, etc.

            Beats the crap out of cobbling things together with BAT files.

  • Noscript (Score:5, Informative)

    by sakdoctor (1087155) on Tuesday March 03, 2009 @03:56PM (#27056221) Homepage

    Progress is running less JavaScript, not more.

    • Re:Noscript (Score:5, Funny)

      by OzPeter (195038) on Tuesday March 03, 2009 @04:03PM (#27056325)
      Sir, I have the '80s on hold on the phone at the moment. They want to know if you want to by some stuff called .. umm .. hang on .. yes here it is .. "static HTML pages" ..
      • Re:Noscript (Score:5, Insightful)

        by wirelessbuzzers (552513) on Tuesday March 03, 2009 @04:10PM (#27056403)

        Actually it was the '90s, but whatever. The thing is, non-DHTML web pages are actually pretty good for most things... what made those early '90s web pages so awful was no CSS, slow connections, and the fact that people really didn't know how to design for this new medium.

        Probably 99% of the web still shouldn't need Javascript or flash, though pages usually do need to be dynamic on the server side.

        • by OzPeter (195038)
          I know it was the '90s, I was just trying for the double joke of it being an advanced concept/vapourware for the '80s. Won't try that one again.

          But the argument against javascript is one that is countered by your own comment: "the fact that people really didn't know how to design for this new medium".

          Javascript is a tool just like another on the Internet. It can be used for good or evil depending on who writes the program. And as you mentioned, retreating from javascript means going back to a purely

        • Actually it was the '90s, but whatever. The thing is, non-DHTML web pages are actually pretty good for most things... what made those early '90s web pages so awful was no CSS, slow connections, and the fact that people really didn't know how to design for this new medium.

          Sure it's fine when you've got a 2GHz processor and a smack of RAM to compile and run an interpretive language -- with the sole purpose of relatively simple data manipulation, validation, and perhaps some light processing to kick a chunk of data back. But when you are talking about serious data crunching, you want code running natively, not in a locked down little box, like SETI@Home, and optimized for that architecture and platform.

          People think because you can put it on the web, you should. That is, at bes

        • what made the 90s awful was black page backgrounds, blink tags, banner tags, HR tags that dripped blood, and the GeoCities navigation that was no doubt accompanying all the above "features"
        • by gnud (934243)
          Seconded. On public pages I use javascript only for 2 things:
          1. Non-essential visual effects (most of which are part of current web standard drafts)
          2. Slideshows (with a "manual" fallack, of course)
        • Probably 99% of the web still shouldn't need Javascript or flash

          I think that opinion, although quite frequently espoused on slashdot, suffers from a problem of framing current technology around past application models. Technology for technology's sake, such as Web 2.0 using AJAX/Flash, is not a wasteful exercise. Technology doesn't only stem from innovation; a good chunk of innovation stems from technology. The efforts with Web2.0 are leading to furthering the refinement of cloud computing and distributed,

    • Which is to say that Javascript is the future of all computing progress.

      "First they ignore you, then they ridicule you, then they fight you, then you win." --Mahatma Gandhi

      Mwuhahaha! MWHahaha! MUHAHAHAAAAAAaaaaaa--

      *cough* *cough* *wheeze* *sputter* *wheeze*

      *clears throat* 'scuze me!

      MWUHAHAHAAHAHAAAA!!!! :-P

      • Re: (Score:2, Informative)

        by Anonymous Coward

        "First they march you through hundereds of miles of jungle without food or water, then they shoot you, then they disembowel you, then you lose." --Mahatma Gandhi, had the Japs won WW2.

    • Yeah, yeah, and Usenet was the ultimate discussion group and everything's been downhill from there, right? And 25x80 column monitors were plenty (who needs proportional fonts?) and color is way overrated, and...

      Why is it that we always need the previous generation twho remembers "what it was like before all this newfangled nonsense" to die off before we can make progress?

      Just because you're looking for the web to look like a static newspaper doesn't mean the rest of the world wants the same thing.

      • by pjt33 (739471)

        Just because you're looking for the web to look like a static newspaper doesn't mean the rest of the world wants the same thing.

        There are situations where JavaScript is good, but it simply breaks things like the ability to bookmark your page and then restore it as it was from the bookmark. Then you have the sites which really abuse it: for example, you can't book a flight with Ryanair if you have JS disabled (or a browser which doesn't support it: they don't seem to have come across the concept of degrading gracefully).

      • by localman (111171)

        How come the new kids who come in can't tell the difference between progress and two-steps-forward-two-steps-back? You make a valid point that some people resist change for poor reasons, but I would say an equal or greater problem is people embracing change for poor reasons.

        DHTML is fine when it works, and it's just starting to get there. But I'd say that web usability was at an all-time low between 2000 and 2006 when the new kids thought everything should be dynamic without the slightest understanding of

        • by Pope (17780)

          Haha, his complaint that the 1st column's background colour won't stretch to the height of the 2nd column has been solved for quite some time.

          The main problem associated with pure table layouts came from good ol' Netscape 4's inability to render very complex tables quickly; thankfully, those days are long gone. These days it's about usability and searchability, which complex table layouts kill dead.

          No one is stopping you from doing pure tables, but the solutions are there in CSS that make things so much bet

  • by Anonymous Coward on Tuesday March 03, 2009 @03:56PM (#27056241)

    You could also use this to index the MP3 files on everybody's hard drives, then share the music just by visiting a URL!! ... oh wait...

    • You could also use this to index the MP3 files on everybody's hard drives, then the RIAA could sue everyone just by visiting a URL!! ... oh wait...

      There, fixed that for you.

  • by wirelessbuzzers (552513) on Tuesday March 03, 2009 @04:03PM (#27056333)

    Javascript really isn't suited for this kind of thing, even with worker threads, for two reasons I can think of. First, web clients are transient... they'd have to report back often in case the user clicks away.

    But more importantly, Javascript just isn't a good language for massive computation. It only supports one kind of number (double), has no vectorization or multicore capabilities, has no unboxed arrays, and even for basically scalar code is some 40x slower than C, let alone optimized ASM compute kernels. (This is for crypto on Google Chrome. Other browsers are considerably slower on this benchmark. YMMV.)

    • by Instine (963303) on Tuesday March 03, 2009 @04:42PM (#27056785)
      and you don't think you could get 100 times more users to visit your web app than you could convince to download and install an exe?
      • by psetzer (714543)

        Or they could use an Applet or JWS and get several times the performance for only a mild reduction in install base. JWS would even be able to run offline or when the browser window's closed and cache some output to a JVM-managed scratchpad file on disk.

      • Re: (Score:3, Interesting)

        It would need to be 10000x at the very minimum.

        If a user downloads, say, folding@home, it's running all day, every day, on all cores of the machine, whenever the computer is on and idle, which is most of the time. The user doesn't have to remember to run it, doesn't have to devote screen real estate, attention and so on, and the program is less annoying because of its low priority and relatively low memory footprint (less boxing).

        Additionally, the 40x I cited was in the fastest available browser (Chrome),

        • by fractoid (1076465)
          Why have you assumed the javascript user ran the site for 5 hours a day for a week, but that the installed .exe user ran it for a year? Even if one accepts your estimate of an installed .exe as being 400x faster than a javascript app, you should at least allow equal running time. And are you sure that modern browsers on multicore machines don't let multiple JS threads run on different cores?

          In which case, I would find it easy to believe that for every one slashdotter who would install a distributed comput
        • I think you are vastly underestimating the JIT engines of Chrome and FF. While these JIT engines still have a way to go, I would expect the execution speed of Javascript to approach the performance of other modern virtual machines like the JVM.
    • by Nebu (566313) <nebuNO@SPAMgta.igs.net> on Tuesday March 03, 2009 @05:25PM (#27057379) Homepage

      Javascript really isn't suited for this kind of thing, even with worker threads, for two reasons I can think of. First, web clients are transient... they'd have to report back often in case the user clicks away.

      I don't see why web clients being transient is a problem. The whole point of the MapReduce algorithm is that each worker (the web clients in this case) don't need to know anything about what the other worker is doing, what the system as a whole is doing, nor what it had done with any past job.

      • by sfcat (872532)

        I don't see why web clients being transient is a problem. The whole point of the MapReduce algorithm is that each worker (the web clients in this case) don't need to know anything about what the other worker is doing, what the system as a whole is doing, nor what it had done with any past job.

        Which is why Map-Reduce is only suitable for "easily" distributed problems. Lucky for Google that almost all their computational problems fit into this mold. But in the rest of the world, this just isn't the case. Which is why Map-Reduce is more interesting and trendy than a solid change in how distributed systems are designed.

    • Javascript just isn't a good language for massive computation. It only supports one kind of number (double), has no vectorization or multicore capabilities, has no unboxed arrays, and even for basically scalar code is some 40x slower than C, let alone optimized ASM compute kernels. (This is for crypto on Google Chrome. Other browsers are considerably slower on this benchmark. YMMV.)

      YMMV is true. I see speed differences of x5-x10 between -O3 C code and V8 - significant, but far from x40.

      As for having only doubles, that is true for the language, but not for engines, which can implement an integer type as well. This is a little tricky to do, but certainly possible: Anything that starts out as an integer will remain one over addition, subtraction and multiplication; you need to add checks for overflows and to handle division. In other words, developers have the convenience of only work

  • Link (Score:5, Informative)

    by Jamamala (983884) on Tuesday March 03, 2009 @04:06PM (#27056357)
    for those like myself that had no idea what MapReduce was:
    http://en.wikipedia.org/wiki/MapReduce [wikipedia.org]
  • by Estanislao Martínez (203477) on Tuesday March 03, 2009 @04:13PM (#27056429) Homepage

    Oh, please, make the MapReduce fanboyism stop.

    Yes, it's a neat technique. It's also very old and obvious. Google's implementation is also good, but this stuff is just not rocket surgery. It's just a simple pattern of how to massively parallelize some types of computational tasks.

    But somehow, just because some dudes at Google wrote a paper about it, it's become the second coming of Alan Turing or something among some silly folks. Hell, a couple of weeks ago somebody was saying on the comments here that MapReduce was a good alternative to relational databases. Now that is silly.

    • Re: (Score:3, Funny)

      by Anonymous Coward
      listen to him everyone, he must know what he's talking about since I don't know what rocket surgery is.
  • I stopped at... (Score:3, Insightful)

    by greymond (539980) on Tuesday March 03, 2009 @04:17PM (#27056481) Homepage Journal

    "Javascript...â" every browser can run it..."

    There is a huge difference between being able to run javascript apps and run javascript apps well - not to forget that a lot of the javascript I see out there really only works on PC's with IE or Firefox, Opera and Safari, especially on OS X seem to have trouble with some sites that aren't coded for compatibility, but instead pushed out quickly with little regard for anything other than IE on Windows.

  • by clinko (232501) on Tuesday March 03, 2009 @04:20PM (#27056505) Homepage Journal

    A common mistake in multi-server builds is that bandwidth is free.

    Bandwidth Costs Money and Time. Both are reduced by having the network closer to the processing. This is one of the reasons google bought all that "dark fiber" left around after the .com bust.

    Another flaw is that computation of data is difficult to provide "good results" in blocks unless they're doing relativity matrices (Think PageRank).

    Something to think about:
    If I'm sending names to your pc, what can I derive from that list without having the entire list?

    • Re: (Score:3, Interesting)

      by Nebu (566313)

      Something to think about: If I'm sending names to your pc, what can I derive from that list without having the entire list?

      Frequency of each name? Frequency of characters in names? Bayesian probability of one character following another in names? Number of names of a particular length?

      Each worker would compute the stats for their chunk of work (the "Map" part of MapReduce), and then send the results back to the server to be aggregated (the "Reduce" part of MapReduce).

      Some of these may seem interesting, but then again, what interesting data can you derive at all from a list of names, even if you had the whole list?

  • Pay Me (Score:5, Interesting)

    by Doc Ruby (173196) on Tuesday March 03, 2009 @04:21PM (#27056529) Homepage Journal

    If there were a couple-few or more orgs competing to use my extra cycles, outbidding each other with money in my account buying my cycles, I might trust them to control those extra cycles. If they sold time on their distributed supercomputer, they'd have money to pay me.

    As a variation, I wouldn't be surprised to see Google distribute its own computing load onto the browsers creating that load.

    Though both models raise the question of how to protect that distributed computing from being attacked by someone hostile, poisoning the results to damage the central controller's value from it (or its core business).

  • by Anonymous Coward on Tuesday March 03, 2009 @04:23PM (#27056553)

    Is this why my browser keeps telling me scripts on the slashdot main page are taking too long and do I want to stop them for the last few months?

  • This seems to me a self-defeating idea. The obvious goal is to get more processing power. Yet using a scripted language is inefficient, and a waste of processing power. If you want more processing power, you need to group computers of the same general instruction set, and which can run compiled (or, dare I say it?) assembled machine code.
  • hijack-weasels who need to be shot have pretty much ruined the idea of distributed donated computing resources, thanks.

  • MapReduce is interesting because at Google and Hadoop it has a distributed filesystem underneath it too. The clever part is how data is distributed, and processing is moved to the data rather than moving data to the processing. I don't really see how this really helps matters, unless you are going to have data involved too - which brings in the privacy concerns yadda yadda yadda.

    Sure, some things would work that require huge amounts of processing power on limited data, but why would you use map-reduce for t

  • The increase in computing power caused by more users joining because it's so simple will be offset by the massive decrease associated with using Javascript rather than native code.
  • I'm sorry, but isn't this practically identical to the patent application to use javascript to treat browsers as distributed clients to perform a job like a distributed super computer? The patent application is at http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220020198932%22.PGNR.&OS=DN/20020198932&RS=DN/20020198932 [uspto.gov]

  • Remember folks, be sure to filter ALL user input!

  • We published a paper that did evolutionary algorithms in the browser some time ago: Browser-based distributed evolutionary computation: performance and scaling behavior [acm.org]. In the same conference, there was another paper: Unwitting distributed genetic programming via asynchronous JavaScript and XML [acm.org]

  • http://it.slashdot.org/article.pl?sid=03/12/31/2246241&tid=93

    MD5CRK used a JavaApplet that used this Chinese Lottery concept. The applet performed 95% as fast as a pure C implementation of MD5. JavaScript is another matter however. And an assebly code that inlieved MMX/SSE with ALU was much faster.

    Background threads in browsers will help of course.

: is not an identifier

Working...