Google File System Evolves, Hadoop To Follow 53
Christophe Bisciglia, Google's former infrastructure guru and current member of the Cloudera start-up team, has commented on Google's latest iteration on their GFS file system and deemed its features well within the evolutionary capabilities of open-source competitor Hadoop. "Details on Google's GFS2 are slim. After all, it's Google. But based on what he's read, Bisciglia calls the update 'the next logical iteration' of the original GFS, and he sees Hadoop eventually following in the (rather sketchy) footsteps left by his former employer. 'A lot of the things Google is talking about are very logical directions for Hadoop to go,' Bisciglia tells The Reg. 'One of the things I've been very happy to see repeatedly demonstrated is that Hadoop has been able to implement [new Google GFS and MapReduce] features in approximately the same order. This shows that the fundamentals of Hadoop are solid, that the fundamentals are based on the same principles that allowed Google's systems to scale over the years.'"
Hadoop (Score:5, Funny)
I wish they would stop taking names from Star Wars.
Re:Hadoop (Score:4, Funny)
These are not the names you are looking for.
Re: (Score:3, Interesting)
Speaking of names, check this out from TFA:
If they keep naming things with coffee references (including Java), what would happen if it's discovered that coffee causes cancer or shrunken balls or what not? It's already going to affect acceptance in Utah. This is why corporations find bland mean-nothing names like "Teamware" or "Altria" or "Inprise". I personally like
Re: (Score:3, Insightful)
Re:Hadoop (Score:4, Funny)
The day "caffeine" becomes a word that is objectionable to a non-trivial chunk of my customer base is the day I know the PC crazies have won.
Re:Hadoop (Score:5, Funny)
I personally switched to IIS to avoid offending my Native American brethren!
Re: (Score:3, Funny)
How is lighttpd offensive to Native Americans? :-)
Re: (Score:2)
Need mod points, need mod points quick! LMAO
Re: (Score:1)
at the expense of offending international astronauts.
Re: (Score:1)
The day "caffeine" becomes a word that is objectionable to a non-trivial chunk of my customer base is the day I know the PC crazies have won.
It's not just PCs! Have you never seen a Mac-head with a latte?
(I object to the term PC for a computer, it's mostly misleading)
Re: (Score:2)
It's not just PCs! Have you never seen a Mac-head with a latte?
(I object to the term PC for a computer, it's mostly misleading)
I object to the term "whoosh". I think it's insulting and Politically inCorrect.
Re: (Score:2)
I object to your face. I think it's ugly and smells like a butt.
Re: (Score:2)
Stuff 9? As in stuff 9 fingers? That's almost like fisting. Pervert!
Re: (Score:1)
Coffee causes/cures cancer (Score:2, Interesting)
If they keep naming things with coffee references (including Java), what would happen if it's discovered that coffee causes cancer or shrunken balls or what not?
Don't have to wait - in the UK one of the more egregious papers regularly publishes a scare story about cancer. So much so that there are sites dedicated to Daily Mail Oncology Ontology.
Curiously coffee falls into both the good and bad camps [tumblr.com].
actually it's not that curious - never let consistency spoil a good rant
Re:Hadoop (Score:5, Informative)
Re: (Score:3, Insightful)
Score: 1, Informative
WTF?
Re: (Score:2)
Re: (Score:2)
I'm actually hearing the Street Fighter 2 announcer yelling "HADOOPKEN! HADOOPKEN!" in my head.
Wrong Link in the Summary? (Score:5, Interesting)
FAT (Score:1)
Kill M$, take the fat.
"open-source competitor Hadoop" (Score:1)
Re: (Score:2)
Because GFS is the foundation for all google apps, and why they end up scaling so well.
Re:"open-source competitor Hadoop" (Score:5, Funny)
It WAS meant to be only internally by Google, but then they accidentally the whole thing.
Re: (Score:1)
Happy?
Re:So it looks like these are for "cloud computing (Score:5, Informative)
If you want to be buzz-word compliant, then yes, kind of.
More to the point, GFS and HDFS are distributed file-systems that are designed to run on potentially very large clusters of commodity hardware. The potential applications are quite diverse. Hadoop itself involves more than just the file-system, but HDFS is really at the core of any application you would want to build with it. This list [apache.org] gives you a good idea of who uses Hadoop and for what purpose.
Re: (Score:1)
"If you want to be buzz-word compliant, then yes, kind of."
I see you're trying to be pedantic douchebag compliant.
Congratulations, you succeeded.
it's alive! (Score:2)
deemed its features well within the evolutionary capabilities of open-source competitor Hadoop.
I didn't know that file systems were living beings that could evolve. I thought they were inanimate and were designed by humans? Should I be afraid? Is it sentient yet?
Re: (Score:1)
Google File System was created by Man (Score:2)
It Rebelled.
It Evolved.
There are many Copies.
And it has a Plan.
Re: (Score:2)
Unfortunate for Hadoop (Score:5, Interesting)
I've been on the market for a distributed, clustered file system for some time. Unfortunately, Hardoop is not really what I'm looking for. What I'm looking for:
1) Redundancy - no single point of failure.
2) Suitable for standard-sized file I/O.
3) Performance that doesn't completely suck ass.
4) Graceful re-integration when bringing a cluster portion back online.
5) Accessible through standard interfaces. (EG: Posix F/S)
6) Doesn't require a PHD in the technology to administer.
7) Doesn't require insane quantities of cash to build.
8) Stable.
There are clustered file systems that have some of these qualities. None that I've found so far have *all* of these qualities.
Hardoop fails on #1, #2, and #6. It has a single nameserver commanding the cluster, so if it goes down, well... (shrug) It also does poorly for "normal" sized files, somehow having a 10 GB file is the norm for Google. And setting a multiple node cluster up is definitely non-trivial.
Of all that I've reviewed, GlusterFS did the best [gluster.com] but even in that case, I ran into severe over-serialization that brought my 6-node cluster to its knees. I tried three times to roll it out, and had to roll back all three times. I fiddled with the brick setup and caches for days before finally throwing in the towel.
Now I get by with rsyncing program files, and a homegrown data distribution setup using network sockets and xinetd. Not optimal to be sure, but so far it's scaled linearly and provides decent performance, at the price of a PHD in said technology. I guess you could compare our technology to MogileFS [danga.com], only our scheme
A) uses DNS records to coordinate the cluster so that it scales up,
B) has a richer "where is the file" schema than the simple flat keys used by Mogile, and
C) has the ability to execute programs against files for performance. (EG: grep for searching text files, tar/gzip for compress/uncompress, virus scans, etc)
D) has the ability to "hang open" for activities like logging.
So far, this has held up well with about 500,000 file operations and millions of log entries per business day with an average file size of about 1-3 megabytes and every sign that growth can continue by simply stacking on more hardware. No, I'm not talking about massive throughput, but I *am* talking about the need for high availability systems that scale nicely without bottlenecks and exorbitant expense. Yes, it works pretty well, but we've had to invest significant programming time to do this.
Guess it's like the old engineering saw: Convenient, Cheap, Quality: pick any two!
Re:Unfortunate for Hadoop (Score:4, Interesting)
Most of what we do is web-based, so we took a hint from GlusterFS and moved the decisional logic to the client. We host the client so we can assume a trustworthy client. This make debugging easy since all we have to do is echo stuff and see it in the browser.
Data stores work something like gluster 'bricks' - they serve as only a data store, nothing more. You can thing of a data store as a webDAV server. Each partition is served by multiple data stores. To keep things simple, data stores trust requests and so 'auto-configure' based on the request.
We divide our data into partitions that correspond to DNS subdomains. Then we use DNS to publish partition data. We provide minimal of two hosts (IP addresses) for each subdomain. All writes are made to all hosts by opening multiple sockets. Reads are read from the first 'best' host after reading header data.
In the case where any host doesn't have matching 'best' data on a read, the socket reverses and a write is performed as read from the best read. This gives us auto-heal as needed. The only sticky point is delete, which we solve by assuming that a delete operation is successful only when all applicable data stores report success.
While implementation details are thorny and expensive, this is a system that should scale to any concievable size since we can partition to as many data stores as there is IP space to. And, by dividing our cloud so that data stores will be grouped along with the client's hosting, we should see near-perfect linear scalability.
Works well so far, but it took over a year of experience to get it all working right, though we certainly weren't working on it exclusively.
What's your project like?
Re: (Score:3, Insightful)
Hadoop is not really a file system or rather as you found out it doesn't make a good one. It's a framework for doing a certain type of parallel computing (map reduce) on very large amounts of data. There's a filesystem (hdfs) in there but it's pretty much designed for running such parallel jobs rather than being a clustered NAS. The filesystem is in some ways even irrelevant as there's actually support for various filesystems (Amazon S3, etc.).
Re: (Score:2)
Reading comprehension, apparently you should learn some. See those three words "in some ways"? Yeah they matter a lot.
I said HDFS is irrelevant to Hadoop as in it's not a vital part of it or as in it's not required because it can be replaced and quite often in.
Re: (Score:2)
Is there any chance your project would be released?
As you found out there are only a couple of Linux clustering filesystems, all with drawbacks. It would be interesting having a new one designed from the start around reliability.
Dont you know there's no evolution... (Score:2)
...it is being intelligently designed.