How To Build a 100,000-Port Ethernet Switch 174
BobB-nw writes "University of California at San Diego researchers Tuesday are presenting a paper (PDF) describing software that they say could make data center networks massively scalable. The researchers say their PortLand software will enable Layer 2 data center network fabrics scalable to 100,000 ports and beyond; they have a prototype running at the school's Department of Computer Science and Engineering's Jacobs School of Engineering. 'With PortLand, we came up with a set of algorithms and protocols that combine the best of layer 2 and layer 3 network fabrics,' said Amin Vahdat, a computer science professor at UC San Diego. 'Today, the largest data centers contain over 100,000 servers. Ideally, we would like to have the flexibility to run any application on any server while minimizing the amount of required network configuration and state... We are working toward a network that administrators can think of as one massive 100,000-port switch seamlessly serving over one million virtual endpoints.'"
Comment removed (Score:5, Insightful)
Re: (Score:2, Funny)
Re:Cable management... (Score:5, Funny)
No, it's wireless silly billy!
Good god, that means it's as reliable as my sex life. Like with REAL people, rather than me just ummm... actually... no, that's fine. Nothing to see here, move along, move along.
Re: (Score:3, Funny)
Hey, based on my sex life that means it's super reliable!
Reliably broken is still reliable, right?
Re: (Score:3, Funny)
Would that be 0% uptime?
Re:Cable management... (Score:4, Funny)
Good god, that means it's as reliable as my sex life. Like with REAL people, rather than me just ummm... actually... no, that's fine. Nothing to see here, move along, move along.
What kind of uptime are you getting?
Re:Cable management... (Score:4, Insightful)
Optical Switching with "no latency" via 10Gb/sec Multimode fiber up to 2 kilometers.
http://en.wikipedia.org/wiki/Optical_switching [wikipedia.org]
http://en.wikipedia.org/wiki/Multimode_fiber [wikipedia.org]
Low heat, low power, can use cheaper diode lasers, and no EMI or RFI issues on the fiber.
I was hoping they would have 100Gb/sec working, but it appears it still in the works.
It can be done in a MUX'd method using DWDM to send multiple freqs of light down the same
line on up to 160 channels on same strand of single mode last I checked.
At least that is what Wiki says.
http://en.wikipedia.org/wiki/DWDM#Dense_WDM [wikipedia.org]
Re: (Score:2)
I was hoping they would have 100Gb/sec working, but it appears it still in the works.
Easy! Lay 10 10Gb/sec cables next to each other!
Or hire bicycle messengers and give them hard drives. For the 2 km with 20km/h you would need to distribute 36 1TB hard drives. You can give each cycler one or give all to one bicycler, depending on what your latency requirements are.
Re: (Score:2)
Re: (Score:2)
Oh no... (Score:5, Funny)
I have nightmarish pictures popping into my head of a waterfall of ethernet cables spewing from this with user's ports un-numbered with no network diagrams. People bashing on the server room door in a zombie like state muttering "MRRRHH FACEBOOK!" "TWWIIIITEEEuggggghh" with me inside screeching "NO! NO! I DONT KNOW WHAT PORT YOUR DESK IS! NO! I CAN'T MAKE THINGS GO FASTER!" before curling up in a ball listening to the hum of servers and the lamentations of the users outside the door desperately scratching to get in.
Re: (Score:2)
I have nightmarish pictures popping into my head of a waterfall of ethernet cables spewing from this with user's ports un-numbered with no network diagrams.
I think this scenario is precisely why BOFHs have PFYs.
Re: (Score:1, Funny)
Re: (Score:2, Funny)
But machetes don't run out of ammunition.
Re: (Score:2)
Actually, my machete's all out of crowbars, thanks. Where'd you get an unlimited supply for yours?
Re:Oh no... (Score:4, Insightful)
What a party it would be for people that likes to do broadcast storms!
Just purge the arp cache frequently and you will have a lot of broadcasts that can clog down the network.
Re: (Score:2)
Re: (Score:3, Informative)
If you have the tools it's possible to crimp one plug to both ends of a loop of wire, so that the port's own send and receive lines are joined. This confuses a router even more than a loop between two ports.
Don't Know What Port (Score:2)
NO! NO! I DONT KNOW WHAT PORT YOUR DESK IS! NO!
That's funny. Because right now I'm doing consulting work for a major bank. They know what port I'm on all the time. In fact, they have software that monitors my traffic and immediately cuts it off if something they don't like happens.
I just bring in my Macbook with an EVDO dongle if I want to surf.
Re: (Score:2)
It's probably just a software firewall that blocks your MAC instead of your ethernet port.
Re: (Score:2)
I have nightmarish pictures popping into my head of a waterfall of ethernet cables spewing from this with user's ports un-numbered with no network diagrams.
Whoa....someone needs a vacation.
Re: (Score:2)
Re: (Score:2)
Oedipus: Give to Oedipus! Give to Oedipus! Hey Josephus!
Josephus: Hey mother-fucker!
Watch out for loose cables! (Score:4, Funny)
I would seriously hate to be the guy that tripped over that power cable.
On the plus side it would be interesting to time how long it took for the DC's phone lines to melt.
-Matt
(redundant, redundant power. I know, I know)
Re:Watch out for loose cables! (Score:5, Funny)
I would seriously hate to be the guy that tripped over that power cable.
A sentry gun will be installed in the power cable corridor, to execute you the precise moment you've done your tripping. So you wouldn't have time to hate being yourself.
(redundant, redundant power. I know, I know)
To answer your worried look: yes, there's a redundant sentry gun for the other cable too.
Re: (Score:2)
How about installing those guns in a way, that vaporizes you right *before* you would trip that cable? Seems to make more sense to me...
Re: (Score:2)
Were you going to install sentry guns above the drop ceiling also?
"5 meters, man. 4, what the hell?"
Re: (Score:2)
Not redundant power. Power over Ethernet! Why should you be able to distinguish between the power cord and the data carrying cables?
Or possibly remote microwave power. So intense that interrupting the beam will destroy anything in the way. No need for machine guns, significant savings on ammo, reduced cremation costs. Win win all around.
You still need isolation (Score:5, Insightful)
I've long been of the opinion that putting more than a few hundred hosts on a single layer 2 network is almost always a bad idea.
What do you do about broadcast storms? How do you prevent some clown from anywhere in that 100,000 machine cloud from poaching another machine's IP address (either maliciously or by an accidental typo)?
Subnets and routers were invented for a reason. Just because you can bridge the whole world together into one massive virtual Ethernet segment doesn't mean you should.
Re: (Score:3, Funny)
Re: (Score:3, Funny)
And you could label the hubs with cheeky names like Wilma, Andrew, Ivan, and Camille.
Re:You still need isolation (Score:5, Informative)
What do you do about broadcast storms?
In the paper they detail how they handle ARP. All other broadcasts you can get away with dropping these days; use multicast instead. (Yes, that will break NETBIOS broadcast name lookups. So sad.)
How do you prevent some clown from anywhere in that 100,000 machine cloud from poaching another machine's IP address (either maliciously or by an accidental typo)?
That is a solved problem if you use decent switches. You can apply pretty much any policy you like.
Re: (Score:3, Informative)
A no-broadcast policy breaks Wake-on-LAN.
Re: (Score:2)
WOL is not all that useful in a heavily-virtualized data center. Besides, if you have 100.000 hosts in one network, it's probably a bad idea to run an unauthenticated protocol like WOL.
You can achieve the same by ssh'ing to the management port and turning the server on anyway.
Re: (Score:2)
DHCP relay. Turns the broadcasts into unicast. Not much of a challenge, that one.
Re: (Score:2)
and then it becomes... say it with me, a router
Nope. Access controls are no more a router feature than they are a switch feature. They're just a feature that decent networking equipment has, no matter which layer it is operating on.
Re: (Score:3, Funny)
Quote fail. Sorry.
Re: (Score:2)
Re: (Score:2)
Yeah but, with all those nodes you could form a beouwolf cluster. Just think for a moment about all those sockets!
Re: (Score:2)
Having done this a while, I've found that large, flat networks actually work quite well. People often bring up all kinds of fears based on folklore from the unswitched hub days, and IMO they just don't apply any more on modern layer 3 switches.
What do you do about broadcast storms?
ACL broadcast default-deny. Broadcast generally isn't needed any more. ARP is proxied by the switch. NetBios broadcast resolution has no place on a large network. Virtually all other niches for broadcast are superseded by multicast these days. If you ever find s
It's all about address management (Score:5, Informative)
The paper is about adding a layer of addressing so that IP and Ethernet addresses can be moved from one machine to another as instances of virtual machines are migrated around. It's not about the problems of physically building a very large switch. The switch components are mostly stock items.
Re: (Score:1)
That PMAC idea is really cool. But beyond that. Nothing special. Try to build something more large and you will find that your core layer switches have not enough ports as number of aggregation level switches will increase. And I am not mentioning problems with throughput when distant nodes will start communicating to each other.
For me it looks like they are trying to make routers redundant. But building 100 000 node network with this topology will require really powerful core layer nodes.
For large datacent
How big is that.....and when it fails... (Score:5, Funny)
Have fun replacing it when it fails. In my head I imagine something like this [dia.org].
Re:How big is that.....and when it fails... (Score:5, Funny)
...and every couple of months the mess of cables will have to be prodded with a broomstick to check for dead network engineers.
Re: (Score:2)
I gotta read comments more carefully. I thought you said "prodded with a boomstick to check for undead network engineers."
Come to think of it, that's probably a good idea too. Kind of a layer 3 LART.
The best bits of layer 2 and 3 eh? (Score:1)
How many LEDs is that? How much power in LEDs? (Score:5, Funny)
Lets see... That's 100,000 ports with 2 LEDs each (link, action/fdx/speed/poe) for a total of 200,000 LEDs. Lets say they use some of the cheapest SMD LEDs on the market. Well use digikey part number 160-1183-1-ND which is a cheap 0603 foot print green LED. At quantity 200,000 that comes out to $12,000 in cut-tape packaging or $9,450 if you buy 210,000 of them in 3,000-qty reels.
Lets say that all of the link LEDs are on 100% of the time and the the activity LED is on 50% of the time. That gives us 150,000 LEDs on at any given point in time. Our example LEDs use 20ma at 2.1V. So 150,000 LEDs at 20ma uses 3Ka. In total, 6.1Kw is burned by the green LEDs.
All that blinking... Damn. I want one NOW!!! More than a girl friend!
Re: (Score:2)
That was pretty much my first thought when I saw the headline, too. I could never, ever manage to use something like this, but I totally want it!
I don't know what I'd do with it. Probably just put a pillow on it and sleep on it just to be close to that much technology.
Re: (Score:2)
And you dont factor in LED duty cycle or voltage drop
Re: (Score:3, Interesting)
So true. We installed an 8 port IP-KVM switch in a rack recently, and the on light was _bright_ blue, to the point that 20m away it felt like it was boring a hole in my head. I cut some paper into ~1cm square pieces and taped a stack of 3 over it, and it still looked excessively bright. I don't know what the designers were thinking.
Re: (Score:2, Informative)
LEDs Magazine ??? (Score:3, Funny)
"Welcome to LEDs Magazine, the leading global information source for the LED community."
Wow, just wow !
Re: (Score:2)
I find that a layer or two of red 3M vinyl electrical tape works wonders at calming down blue LEDs, while still maintaining their general utility.
Green tape works pretty good on overly-bright red LEDs. And so on.
You mean (Score:5, Insightful)
I can't just go out and buy 33,334 d-links and turn off DHCP on all but one of them?
Re:You mean (Score:5, Funny)
Their next project is a 33,334-outlet power strip capable of holding that many wall warts without either crashing through the floor or shearing off the faculty wall.
Re:You mean (Score:4, Insightful)
Re: (Score:2)
Define "standard router": Home-grade equipment.
Wow (Score:1)
Re: (Score:2)
Not to mention a shitload of crossover cables to link the damm switches together.
Rehashing of long-abandoned ideas (Score:5, Insightful)
Without getting too far into it, their brilliant plan to to insinuate a layer 2 and a half using "pseudo MAC addresses," using a directory service rather than broadcasts. They're hoping they can use this mess to paper over horrific network design.
Yeah, I'll grant you you might be able to cobble this mess together in an academic setting, and sure, you'll even be able to rig some demos that show miraculous increases in speed.
I can guarantee they'll find funding with their promise you'll even able to hire even LESS skilled network admins, meaning Zaboomafoo the Typing Lemur now has a shot at his CCIE.
But, damn, you ignorant twits. Most corporate networks are already mashed together by the most cut-rate cable monkeys they can find. The last thing we need is some half-assed "protocol" that will guarantee even more network designs that are guaranteed to trip and break their necks over the first packet.
Re: (Score:2)
Comment removed (Score:5, Interesting)
Re:Rehashing of long-abandoned ideas (Score:4, Funny)
And the programmers are the worst - every one of them thinks that being able to write software makes them qualified to administrate a nation-wide network, especially because they have a network at home, you see, and also do computer work for their friends and family.
Re: (Score:2)
I'd take the job if I could either get in writing that I get to replace anything that offends me, or if I were going hungry. Sounds like the cheapest and easiest option would have been to just replace the lot. It makes the most sense as a contract job though, the last part of the contract is helping to hire the admin who will work there.
Re:Rehashing of long-abandoned ideas (Score:4, Interesting)
That sounds like a law office I spec'd a job for. The law office manager knew me from her previous place where I was the "IT" guy. So this law office is having ALL sorts of network, computer and server problems, and asks me for a bid to fix it.
I scope the joint, prepare a bid, and I figure it is (using numbers from memory) it was $25,000 for everything installed setup and running: new HW, Server, computers and wiring (small office). EVERYTHING was BRAND NEW.
Their existing guy (I won't even call him IT) under bid me by $10K. They asked me to requote, and I told them no thanks. Obviously I didn't get the job.
Well, a few months later they called me back to try to fix what was done by this other guy. I look, and his wiring was flat phone cable (cat 2???) stapled to the wall in pretty "rows". Recycled home grown computers and I didn't even bother to look at the "server". I was too afraid.
I said to the Manager, "Network is flakey and nothing works right, huh?". Anyway, they ask me to requote them, and I hand them a copy of my original quote for $25,0000 and say "here".
About this time, I notice all the file cabinets are covered in blue tarps, and see the roof is leaking from the rain. The office manager tells me that they do this every winter when it rains. I ask why they don't get it fixed.
"Because when it is raining, they can't fix it, and when it isn't raining, it isn't a problem".
The funny thing is, they spent the $15K of the original quote the guy quoted, and another $20K in service fees to the same guy trying to fix the new system he just put in ... in A FEW MONTHS!!!
I came to the conclusion that many lawyers aren't that bright. They pinch pennies while pissing away C notes.
I have no idea if that law firm's network ever ran right. The office manager quit shortly afterwards.
SO, it doesn't surprise me that what you saw was in a law office.
Re: (Score:2)
Re: (Score:2)
I could tell you story after story about lawyers. They're logic sucks. I think it is a requirement of the legal system that logic need not apply to anything.
Why that is so hard for law firms to understand (Score:2)
Why that is so hard for law firms to understand I'll never know.
Because in the field of Law and Business, when someone says it's so, that makes it so. The judge says he's guilty, so he's guilty. A senior partner says you're wrong, so you're wrong. The highest authority in the room are twelve people who have no idea what's going on, and the highest authority in the land are nine people who can't tell you the price of milk.
People who eat, sleep and breathe in that atmosphere become extremely disconnected from reality. They tend to take it personally when someone tells the
It's absolutely a bad thing (Score:3, Insightful)
Wizards, scripts, GUIs and "automagic" are awesome tools. I love my OSPF. I love my Spanning Tree. I love my VTP. I love my Auto speed and duplex settings. I love every tool that helps me take care of tedium and drudgery.
But before you hand these tools to a network designer, they absolutely need to understand HOW and WHY those tools do what they do, lest your network ends up looking like it was built by Mickey the Wizard's Apprentice. Powerful tools require MORE skill on the part of the network admin, not l
Re: (Score:2)
You're right, you can USE a vacuum cleaner (Score:2)
in ignorance, but you better not try to design one that way. If your job is vacuum cleaner DESIGN, it would really help if you knew how to wind that motor, and more importantly, WHY that motor was wound that way. But certainly, if you're the janitor, then feel free to push that handle back and forth, serene in the knowledge that someone else has done the heavy lifting for you.
I'm talking about the damage this idea would do to network design when Billy-the-uberl33t-LAN-Party-Badass tries to recable his Daddy
Cletus? (Score:2)
Cletus, is that you? How 'bout you, Billy and Zaboomafoo just move on out of the server room before you break something again, OK? I'd really like to sleep a whole eight hours tonight without getting yet another panicked phone call...
Failed the CCNA five times straight, did we? (Score:2)
Look, you're whining about the complexity of network design without understanding WHY there's complexity. We didn't break up layer 2 and layer 3 for the simple fun of it. We did it because we HAD to. There's a long reason and history for each piece of modern networking -- yes, even a kludge as ugly as NAT -- but you're not bothering to even try to wrap your head around any of it. For someone who whines about "world-proofing children" in your sig, you've undertaken precious little of it yourself. You sound j
It's like talking to a rock (Score:3, Insightful)
Yes, I'm going on and one trying to explain the technical side of it to you, but it's starting to feel a little like trying to explain math to a dog.
You're complaining about network complexity when you have no clue about WHY it's complex. Your asking that building networks be "easier," but you have no clue what you even mean by that.
So please, if you're not able to talk to the grownups about the real issues, step away from the keyboard. You're worse than the idiots showing up locked and loaded at the local
Have you even read the proposal?! (Score:3, Informative)
They're not reducing complexity. They're proposing sandwiching another layer between two and three. It's not going to make things easier to design and troubleshoot. It's going to end up causing more trouble than it's worth. The only people who like this idea are salesguys like you who will have a new buzzword to sell.
But hey, by all means, implement this scheme. You're going to end up needing twice the network engineers you do now. The network explosions it will cause will be epic, the stuff of legend like
Re: (Score:2)
Sounds like the skepticism about Ethernet from the Token Ring fans in the day. How could you possibly get any communication done with packets colliding all the time? As for the random backoff, how can adding randomness make a network MORE reliable?!?
It doesn't sound like their objective has anything to do with allowing trained lemurs to do networking (I thought they were already handing out CCIEs to lemurs). It also doesn't sound like speed is the intent other than allowing larger scale layer 2 switching wi
This seems to be a solution to a nonexistent probl (Score:5, Insightful)
This seems to be a solution to a nonexistent problem. A big router, for example a cisco CRS, can be a single node supporting any data center. And it is a router, so there is no need for any exotic solution (L3 inspection on a switch?). It has a max bandwidth of 80Tb/s or 80,000 Gb Ethernet nodes. The beauty is of course that you can configure your entire data center with a single router, which greatly simplifies the network configuration, and makes changes simple.
Re: (Score:2)
it is certainly *not* multiple routers connected together with fabric racks. First, it is configured as a single router, and appear as a single router in the network topology. Secondly, the bandwidth behind each 40Gb/s card is about 200Gb/s to have the entire box behave as a single nonblocking router.
Imagine the size of WALLWART on that thing (Score:2, Funny)
I wonder if D-Link has any?
(swoooosh)
I read that as "Walmart" (Score:2)
bad quote: "Imagine the size of the Walmart needed to hold that thing!"
SMB (Score:2, Funny)
And then... let's say 10% of all computers starts up a SMB-share... welcome to broadcast heaven (or hell) :)
NATting layer two. (Score:5, Interesting)
They're basically NATting the layer two protocols. Combined with a super spanning tree for the natted addresses they're practically boosting layer two into layer three.
Before I read the paper I was thinking that it would be easier to just run all your services NATted at layer three, even using something like PPPoE (which is how cable networks solve the same basic problem, with something like half a million end-points on the same subnet). I guess it's more efficient to work with the simpler layer two protocols instead.
Idiots - if they had used 10base2 ... (Score:5, Insightful)
... they have only needed 1 port! :)
Excellent idea! (Score:2, Insightful)
...and when this switch blows the fuses, you have 100.000 servers offline instead of 24... Brilliant!
VLANs (Score:2)
To everybody that says NO! to this (Score:2)
The past does not equal the future. Hardware improves, software improves.
Just because you were taught from birth that you should have thirty-five 100 port switches in your building and that is what you have always done does not mean you should continue to do it. Network engineers seem to LOVE buying lots of hardware (when given the money). Maybe it's just the cool factor, maybe they want job security? It WOULD be far easier to manage a single switched fabric flat network if you have the hardware and the fai
Single point of failure (Score:2)
Read Dr. Vahdat's blog post (Score:2, Insightful)
I regularly read Dr. Vahdat's blog [wordpress.com]. I first got interested in it after reading his paper on Epidemic Routing [ucsd.edu] which can be found in his list of publications here [ucsd.edu].
If you read his blog post you will see that he accomplishes his goal by creating a hierarchical tree of MAC addresses instead of a simple table. He also states that a large part of the proliferation of MAC addresses in these systems is due to virtual machines. Therefore everyone's nightmares of cabling hell are relatively moot.
Though I haven't cont
Re: (Score:3, Interesting)
Take great care not to use any MAC addresses that are already in use. One would probably need to purchase/register entire blocks of MAC addresses just as a manufacturer of network adapters must do. Or...
Or simply use the private/local range of MAC addresses (02:xx:xx:xx:xx:xx) (The MAC address equivalent of ,say, 10/8)?
Re: (Score:2)
According to wireshark some of those are reserved to actual hardware vendors.
Re: (Score:2)
According to wireshark some of those are reserved to actual hardware vendors.
Assuming that those aren't specifically cited as locally administered addresses, I'm sure there are some duplicates in there as well, something else vendors shouldn't be doing. OUI's shouldn't really be starting with 02.
http://en.wikipedia.org/wiki/MAC_address#Address_details [wikipedia.org]
How long till (Score:2)
How many IT staff would go mad in the sea of network wires?
At the point of 100,000+ ports I would rather invest heavily in research to make a wireless switch that can handle 100,000+ connections at Gigabit speeds (and of course a corresponding wireless devices interface for each rack).
100,000 eggs, 1 basket. (Score:2)
Lemme see - 100,000 eggs, one basket.
Good idea.
MAC-ADDRESS-TABLE (Score:2)
I can't wait until it reaches the limit on the MACS it can learn and just starts forwarding. :-)
ring topologies (Score:2)
Re: (Score:2)
Can one say "single point of failure"?
Re: (Score:2)
...that the answer involves duct tape.
Pfft... you only need duct tape if you want it to look pretty. Otherwise there's nothing stopping you from piggybacking 16,666 of these together [tp-link.com].
I can get them for $13.99 each, bringing the whole thing to just $233,158! That's excluding the cost of connecting wire, of course. Lots and lots and lots of wire...
Does anyone know what the arrangements are to receive my consultant's fee for that answer?
Re: (Score:2)
Re: (Score:2)
Just offer a 'buy 16,666, get one free!' Works every time.
Re: (Score:2)
~Sticky