Boosting Socket Performance on Linux 138
Cop writes "The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results."
Be aware (Score:4, Funny)
Re:Be aware (Score:4, Insightful)
Exactly... especially with things like these, it's usually best for the entire internet if you just stick with the defaults... they are defaults for a reason, it might not be the best for you, but it's most likely the best for the internet as a whole.
Reminds me of those people tweaking firefox settings to hammer all kind of webservers... sure, your browsing might be a slight bit faster, at the expense of the browsing of lots of other people...
Re:Be aware (Score:3, Interesting)
Re:Be aware (Score:2, Interesting)
Re:Be aware (Score:2)
I'm glad I'm not the only one who remembers this. We used to call it, "Netrape", because of this behavior.
I still kind of miss Mosaic.
Re:Be aware (Score:1)
Re:Be aware (Score:1)
I think my point is; what might seem like network abuse today is likely to be SOP in a few years time.
Re:Be aware (Score:3, Interesting)
are you sure?
From a paper written by Phil Dykstra, back in 1999.
"A recent example comes from the Pacific Northwest Gigapop in Seattle which is based on a collection of Foundry gigabit ethernet switches. At Supercomputing '99, Microsoft and NCSA demonstrated HDTV over TCP at over 1.2 Gbps from Redmond to Portland. In order to achieve that performance they used 9000 byte packets and thus had to bypass the switches at the NAP! Let's hope that in the future NAPs don't place 1500
Re:Be aware (Score:2)
Re:Be aware (Score:4, Informative)
slashdotted? (Score:5, Funny)
Re:slashdotted? (Score:2)
Judging by the response time from IBM's web server, it looks like they have yet to put their advice into practice.
... or too many Slashdot visitors already did that exact thing... :-)
Re:slashdotted? (Score:2)
Boost socket performance on Linux
Four ways to speed up your network applications
M. Tim Jones (mtj@mtjones.com), Senior Principal Software Engineer, Emulex
17 Jan 2006
The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high perf
Re:somewhat old... (Score:2)
I agree though, nothing earth shaking. Nagle's algorithm is discussed in depth in most TCP/IP books, and so is how to turn it off. Wake me up when they post something new.
GNU/Linux®...A lessefficent way to say Linux (Score:2, Funny)
It's "GNU divided by Linux" (Score:2)
The GNU community has been divided by Linux into two subsets: dogmatic types who work on HURD and the pragmatic majority who just use what they call "GNU/Linux". Therefore, "GNU/Linux" (also spelt "GNU÷Linux") is completely accurate :-)
Re:GNU/Linux®...A lessefficent way to say Lin (Score:2)
Hello 1995 (Score:5, Insightful)
Re:Hello 1995 (Score:1)
One of the great things about computers is they allow different implementations of the same idea. Because of this, someone who knows how to tune the networking on one OS may not know how to on Linux. Also, not everyone has been programming since 1995. Do you also complain when the weather report comes on the local news, because you've seen a weather report before?
Re:Hello 1995 (Score:3, Insightful)
Now if only the article actually covered something specific to Linux, I'd agree with you. About the most useful thing it does is tell you the location of the same parameters that you muck with on every other system in existence. This info has only been around for Linux for, oh, more than a decade. Pick up any bo
Re:Hello 1995 (Score:4, Interesting)
In the same line - where is the discussion of different FD table polling mechanisms? select() versus poll(), and wheres the writeup about Linux's epoll(). I would have been interested in an epoll() article, especially how it compares to FreeBSD's kqueue().
Re:Hello 1995 (Score:5, Informative)
For the overview, you want Dan Kegel's c10k page:
http://www.kegel.com/c10k.html [kegel.com]
Hello 2003. (Score:5, Interesting)
Documentation like this is great and extremely valuable. It would be much more valuable, however, if it remained current. For example, can the ABISS [sourceforge.net] project (which improves block I/O) be used at all? What do the numbers look like, when using profiling tools like Web100 [web100.org] (which profiles TCP communications)?
Has anyone run the Linux or one of the *BSD kernels through DAKOTA [sandia.gov], KOJAK [fz-juelich.de] or PAPI [utk.edu] to determine where, precisely, bottlenecks are within the kernels? It's easy to theorise, but isn't it cleaner to measure?
Now, I'm not saying these things aren't being done. They probably are, somewhere, by someone, but if the results aren't getting published we don't really know what impact what changes are going to have. The current method of evolving Operating System code in general is often a mix of personal theory and subjective experience based on non-random samples of activity. That can't really be a good way to do things, can it?
If I'm wrong, feel free to say. If I'm right, then maybe it would be a good thing if someone (possibly me) put together some kind of testing kit for measuring Linux kernel performance and actually measured the stats for Linux kernels on some kind of regular basis.
Re:Hello 1995 (Score:2)
Vuja de rules!
Re:Hello 1995 (Score:2)
Actually, given that it's 2006, I would have thought that the socket layer would be smart enough to perform these sorts of "optimizations" for you automatically, by analyzing your usage patterns. There's no reason the programmer should have to deal with any of this crap, except maybe by providing a broad hint such as "Maximize throughput" or "Minimize latency."
Re:Hello 1995 (Score:2)
To a certain degree, they are optimized. Since most network activity occurs through a higher level networking API (e.g. HTTP), the network performance is already optimized by the library. It's not all that often that you have to open a direct socket unless you happen to be writing such a library or server.
Which just further points out how much this a
Re:Hello 1995 (Score:2)
BDP = link_bandwidth * RTT
100MBps * 0.050 sec / 8 = 0.625MB = 625KB
Note: I divide by 8 to convert from bits to bytes communicated.
So, set your TCP window to the BDP, or 1.25MB. Where does
Is this a typo, or am I missing something in the calculation?
Re:Hello 1995 (Score:2)
Re:Hello 1995 (Score:2)
What do you mean "if"?
TWW
Re:Hello 1995 (Score:4, Informative)
From the MAN page [linuxmanpages.com]:
The article could have better explained that in context. For the most part it's automatic though, so don't worry about it.
Re:Hello 1995 (Score:2)
The article could have better explained that in context. For the most part it's automatic though, so don't worry about it.
Thanks, that is the answer. Hopefully others will see it.
Re:Hello 1995 (Score:2)
Re:Hello 1995 (Score:2)
Because most of us know more than you think you do?
What should an article about an API designed in 1983 in a language dating back to 1972 supposed to look like?
Old.
Barring that, defintitely not "News for Nerds" or "Stuff that Matters".
And I doubt the poster actually read it considering it describes features specific to Linux 2.6 (e.g. I don't think 2.4 actually supported setting SO_{SND,RCV}BUF).
You do realize that SO_SNDBUF and SO_RCVBUF are part of the POSIX standard [jaluna.com],
Re:Hello 1995 (Score:2)
Yeah? So does this mean you think Linux is POSIX compliant? If so, then maybe you should spend more time coding than posting drivel on
Re:Hello 1995 (Score:2)
For the most part? Yes. It's not fully POSIX compliant, but close enough. Patches exist in the wild that make it 100% POSIX. It's actually been a pretty big thing to the Linux community to reach a compliant state.
If so, then maybe you should spend more time coding than posting drivel on
I'm sorry, is your point that SO_[SND|RCV]BUF wasn't in 2.4? Or 2.2? Because (as we can see from this pretty manpage [ed.ac.uk] for Linux 2.0) it was. So there's no reason t
Re:Hello 1995 (Score:1)
Summary ripped directly from article (again) (Score:2, Informative)
Here is the summary:
The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results.
Here is the first paragraph of the article:
The Sockets
Re:Summary ripped directly from article (again) (Score:1, Offtopic)
Re:Summary ripped directly from article (again) (Score:2)
-l
Re:Summary ripped directly from article (again) (Score:2)
No wonder he was un
No mention of alternatives to select? (Score:5, Informative)
Code Portability (Score:1)
It's funny you should mention this. I was thinking of the class libraries or frameworks, if you will, included with Java, MFC (if it's still used these days), Visual Age, and so on. Does this mean, and are you saying, that the only way to get the best performance from TCP/IP is to roll your own
Re:Code Portability (Score:5, Informative)
Re:Code Portability (Score:2)
depends on how it's used (Score:2)
Re:depends on how it's used (Score:2)
True enough. But how many applications nowadays are written expecting no more than ~32 simultaneous connections?
Re:depends on how it's used (Score:2)
The argument isn't that select and poll don't work with large numbers of sockets, the problem is that they aren't scalable. Both system calls take, as an argument, a list of file descriptors the program is interested in listening to. If there are hundreds (or thousands) of sockets open, those lists can become unweildy. The scalable alternatives remember which file descriptor the process has asked to be notifie
Re:No mention of alternatives to select? (Score:2)
Ok, the issue is how many fds you can pass. With select() you are limited to a bitmaps worth. And performance has never been much of an issue.
Of course, poll() is a different matter -- if you are passing 100s or thousands of fds.
But, what has this got to do with the tcp connection? Not much.
So, you speed up poll() and still write small packets, and nagle won't write them out immediately... That's about the only connection here.
Ratboy.
Re:No mention of alternatives to select? (Score:2)
Agreed. But the topic of his article is "Boost socket performance on Linux," not "How to optimize TCP layer use on Linux." And the article deals almost entirely with API-level settings. It just seems odd to me that he'd not mention issues with some of the classic BSD functions.
Re:No mention of alternatives to select? (Score:3, Informative)
http://www.kegel.com/c10k.html [kegel.com]
Very awesome paper. How do _you_ make a server that handles 10,000 connections?
--jeff++
Re:No mention of alternatives to select? (Score:5, Informative)
http://www.xmailserver.org/linux-patches/nio-impr
The website is hideous, but there used to be benchmarks against different polling/selecting methods. If I remember correctly, its kinda trial and error, YMMV, kind of stuff. Its worth a look.
Nothing new (Score:1, Funny)
Re: (Score:2)
Re:GNU/Linux®? (Score:3, Informative)
Re:GNU/Linux®? (Score:2)
Duh?
Re:GNU/Linux®? (Score:1)
IBM is getting some good Linux content... (Score:5, Interesting)
Folks who are thinking about writing something technical - give dW a shot. The editors are savvy folks and there's lots of good stuff up there already.
Oh, and book plug [pmdapplied.com]!
I've always wanted to know if it is possible (Score:2)
Re:I've always wanted to know if it is possible (Score:1)
Re:I've always wanted to know if it is possible (Score:2)
Re:I've always wanted to know if it is possible (Score:5, Funny)
It ignores you except at feeding time, and pees in your shoes when it's mad at you?
Re:I've always wanted to know if it is possible (Score:5, Informative)
Is that what you're looking for?
Re:I've always wanted to know if it is possible (Score:2)
Hmm I'll look into it. Thanks!
man nc (Score:1)
Nagle's algorithm (Score:5, Interesting)
To get around the above problems, I came up with the following scheme: Leave Nagle's algorithm enabled, but create a FlushSocket() function that merely disables Nagle on the socket, then calls send() on the socket with a 0-byte buffer, then enables Nagle again. This apparently forces the TCP stack to immediately send any data that it may have accumulated in its Nagle-buffer. Therefore the only thing the calling code has to remember to do is to call FlushSocket() whenever it has called send() one or more times and doesn't think it will be sending any more data any time soon.
The above technique seems to work pretty well under Linux, Windows, and OS/X (and is more portable than Linux-specific flags like TCP_CORK, etc), but I haven't seen it documented anywhere. Is that simply an oversight, or is there some nasty downside to this technique that I'm overlooking?
Re:Nagle's algorithm (Score:3, Interesting)
calls you have to pay for?
if you have some knowledge about the natural grouping of data,
it would be better to just turn nagle off and do buffering
in user space (collect up enough data and send it all in one
go)
Re:Nagle's algorithm (Score:1)
Most programmers do this "natural grouping" anyway and write the data to the socket in a single buffer only when they want them to be sent. The problem is that sometimes their grouping is not
good enough and perform multiple writes when they could perfor
Re:Nagle's algorithm (Score:1)
Not exactly, you'd better need a send() flag to tag data to be sent immediately. No need to slow down your application with another syscall.
willy
Re:Nagle's algorithm (Score:3, Informative)
Re:Nagle's algorithm (Score:1)
I tried this in the past and it was not that good because of the added syscalls. In a pure network application, your worst ennemy are syscalls. And by avoiding this trick and carefully grouping your data into large writes, you both reduce the number
Re:Nagle's algorithm (Score:2)
At one point I submitted a patch that would add a TCP_FLUSH "option" that saved the TCP_CORK and TCP_NDELAY flag values, called the low-level flush routine, and then reestablished the flags.
It was rejected. (But I still use it from time to time on my own, love that Open Source. 8-)
Meanwhile, just drop and restore Nagle as fast as you can, it will save y
Re:Nagle's algorithm (Score:2)
That's a good point -- the only reason the send() is in there was because otherwise the trick doesn't work under MacOS/X. I will #ifndef __linux__ that line in my code though.
flush( sd ) would be nice (Score:1)
Re:flush( sd ) would be nice (Score:2)
See my previous post above ("Nagle's Algorithm") for a way to do it.
Re:flush( sd ) would be nice (Score:1)
Re:flush( sd ) would be nice (Score:2)
I agree, that would be nice... good luck getting it into the POSIX standard anytime soon though. :^(
Also could you post the code of your Flush function? I find the description a little confusing at some points.
Sure, here is the code:
void FlushSocketOutput(int s)
{
SetNaglesEnabled(s, false);
send(s, NULL, 0, 0);
SetNaglesEnabled(s, true);
Re:flush( sd ) would be nice (Score:2)
- disable nagle
- set blocking mode
- set tcp buffer to 0 bytes
- write 0 bytes
- put things back the way they were
Recall that fflush() blocks until the data makes it to disk; I expect he'd want to block until the socket buffers were empt
Re:flush( sd ) would be nice (Score:3, Interesting)
I don't know if that really makes sense for networking though... the reason you'd want fflush() to block until the data makes it to disk is so that once your call to fflush() returns you know that your written data is safe in the event of a crash or power failure. (Although with too-clever hard drive firmware I'm not so sure even that's true anymore!). With networking on the ot
Re:flush( sd ) would be nice (Score:2)
Beats the hell out of me!
I've noticed that in the vast majority of instances (but not all) programmers looking for this type of solution are trying to apply a band-aid to a poor design anyhow. I try not to think too hard about poor designs.
Math error in paper? (Score:3, Informative)
throughput = window_size / RTT
110KB / 0.050 = 2.2MBps
If instead you use the window size calculated above, you get a whopping 31.25MBps, as shown here:
625KB / 0.050 = 31.25MBps
That's funny, I get 12.5MBps
???
Reason for the eror (Score:1)
looks like they were doing the calculations with a calculator and somebody presses '*' instead of '/' !!!!
625 * 5 = 3125
a forgivable slip of the finger's tip.
Re:Math error in paper? (Score:1)
Boost socket performance on Linux [ibm.com]
Tom Young
dW Linux editor
Socket tuning (Score:2)
Re:Socket tuning (Score:2)
I run a Linux firewall, so I put in a six layer quality-of-service set. I put "ver
Re:Socket tuning (Score:1)
All you need to do to get outstanding performance on an asymetric line is the following:
[ON THE ROUTER]
1. Prioritize TCP ACK Packets as #1 to always go upstream first to your connection
2. Restrict upload rate by 2% - 5% of the actual upload rate of the connection
Do these three things, and enjoy a fast connection in both dire
GNU/Linux® (Score:1)
Re: GNU/Linux® (Score:2)
Always liked the Winsock Lame List (Score:2)
However Lame List [tangentsoft.net] contains a lot of wonderful nuggets.
I must disagree with the article however, there are so SO few times that disabling the Nagle algorythm is the correct answer that the standard answer when someone asks about it on the networking forums is that the asker doesn't understand Nagle, and to reenable it. Telnet is even a bastard case in that your networking performance
The trouble with the Nagle algorithm (Score:5, Interesting)
Here's the real problem, and its solution.
The concept behind delayed ACKs is to bet, when receiving some data from the net, that the local application will send a reply very soon. So there's no need to send an ACK immediately; the ACK can be piggybacked on the next data going the other way. If that doesn't happen, after a 500ms delay, an ACK is sent anyway.
The concept behind the Nagle algorithm is that if the sender is doing very tiny writes (like single bytes, from Telnet), there's no reason to have more than one packet outstanding on the connection. This prevents slow links from choking with huge numbers of outstanding tinygrams.
Both are reasonable. But they interact badly in the case where an application does two or more small writes to a socket, then waits for a reply. (X-Windows is notorious for this.) When an application does that, the first write results in an immediate packet send. The second write is held up until the first is acknowledged. But because of the delayed ACK strategy, that acknowledgement is held up for 500ms. This adds 500ms of latency to the transaction, even on a LAN.
The real problem is that 500ms unconditional delay. (Why 500ms? That was a reasonable response time for a time-sharing system of the 1980s.) As mentioned above, delaying an ACK is a bet that the local application will reply to the data just received. Some apps, like character echo in Telnet servers, do respond every time. Others, like X-Windows "clients" (really servers, but X is backwards about this), only reply some of the time.
TCP has no strategy to decide whether it's winning or losing those bets. That's the real problem.
The right answer is that TCP should keep track of whether delayed ACKs are "winning" or "losing". A "win" is when, before the 500ms timer runs out, the application replies. Any needed ACK is then coalesced with the next outgoing data packet. A "lose" is when the 500ms timer runs out and the delayed ACK has to be sent anyway. There should be a counter in TCP, incremented on "wins", and reset to 0 on "loses". Only when the counter exceeds some number (5 or so), should ACKs be delayed. That would eliminate the problem automatically, and the need to turn the "Nagle algorithm" on and off.
So that's the proper fix, at the TCP internals level. But I haven't done TCP internals in years, and really don't want to get back into that. If anyone is working on TCP internals for Linux today, I can be reached at the e-mail address above. This really should be fixed, since it's been annoying people for 20 years and it's not a tough thing to fix.
The user-level solution is to avoid write-write-read sequences on sockets. write-read-write-read is fine. write-write-write is fine. But write-write-read is a killer. So, if you can, buffer up your little writes to TCP and send them all at once. Using the standard UNIX I/O package and flushing write before each read usually works.
John Nagle
Re:The trouble with the Nagle algorithm (Score:2)
you can tell TCP that you are willing to accept d amount of delay, with the default being the 500 ms previously used and assigned. Thus protocols like X could state that they don't need to hang waiting for an ACK, while programs that should hang waiting for ACK will continue to do so.
This extension would only require recompiling the programs that attempt to not do the prior default action of that delay, such as recompi
...what about UDP? (Score:1, Offtopic)
Pining for Doors (Score:2)
Linux is quite tragic that way. Hopefully there will be a Debian user-land on the OpenSolaris kernel soon, and then I can rock-n-roll again.
Re:Never trust an article with a (R) symbol... (Score:2)