Forgot your password?
typodupeerror
Operating Systems Software Technology

Extreme Memory Oversubscription For VMs 129

Posted by Soulskill
from the eXtreme-virtualization dept.
Laxitive writes "Virtualization systems currently have a pretty easy time oversubscribing CPUs (running lots of VMs on a few CPUs), but have had a very hard time oversubscribing memory. GridCentric, a virtualization startup, just posted on their blog a video demoing the creation of 16 one-gigabyte desktop VMs (running X) on a computer with just 5 gigs of RAM. The blog post includes a good explanation of how this is accomplished, along with a description of how it's different from the major approaches being used today (memory ballooning, VMWare's page sharing, etc.). Their method is based on a combination of lightweight VM cloning (sort of like fork() for VMs) and on-demand paging. Seems like the 'other half' of resource oversubscription for VMs might finally be here."
This discussion has been archived. No new comments can be posted.

Extreme Memory Oversubscription For VMs

Comments Filter:
  • Given how many programs leak memory. Its amazing that companies get away with oversubscribing memory without running into big issues. And desktop programs are usually the worst of the bunch.

    • Leaky memory, you need NuIO memory stop leak, just pour it in and off you go.....NuIO a Microsoft Certified Product

    • Re:Leaky Fawcet (Score:5, Informative)

      by ls671 (1122017) * on Tuesday August 10, 2010 @11:35PM (#33212184) Homepage

      Memory leaks usually get swapped out... your swap usage will grow but the system will keep going just as fast since those pages will never get swapped in again. I have tried several times to explain that to some slashdotters that bragged about not using any swap space anymore nowadays and that called me stupid for reserving a 2 gig swap partition or more on a 4 gig ram machine that sometimes runs for 2 years before rebooting.

      Oh well....

      • Re:Leaky Fawcet (Score:5, Interesting)

        by Mr Z (6791) on Tuesday August 10, 2010 @11:48PM (#33212232) Homepage Journal

        Sometimes that doesn't work out so well. If you have a fragmented heap with gaps between the leaked items that keep getting reused, it can lead to a lot of strange thrashing, since it effectively amplifies your working set size.

        I think that may be one of the things that was happening to older Firefoxes (2.x when viewing gmail, in particular)... not only did it leak memory, it leaked memory in a way such that the leak couldn't just stay in swap.

        • Sometimes that doesn't work out so well. If you have a fragmented heap with gaps between the leaked items that keep getting reused, it can lead to a lot of strange thrashing, since it effectively amplifies your working set size.

          I think that may be one of the things that was happening to older Firefoxes (2.x when viewing gmail, in particular)... not only did it leak memory, it leaked memory in a way such that the leak couldn't just stay in swap.

          Wouldn't that be a good exercise for kernels? Recording the usage patterns of memory subsections, defragmenting them into segments by usage frequency. If that is not possible at runtime, store and apply at the next run.

          Or maybe clustering chunks by the code piece that allocated it would already help. That said, I don't know what malloc's current wisdom is.

          • by Mr Z (6791)

            The heap is entirely in userspace, and the kernel is powerless to do anything about it.

            Imagine some fun, idiotic code that allocated, say, 1 million 2048 byte records sequentially (2GB total), and then only freed the even-numbered records. (I'm oversimplifying a bit, but the principle holds.) Now you've leaked 1GB memory, but its spread over 2GB space.

            The kernel only works in 4K chunks when paging. Each 4K page, though, has 2K of leaked data and 2K of free space. For all the subsequent non-leak allocati

          • by ensignyu (417022)

            The kernel can only defragment pages, which are 4KB on most Linux systems. If you have a page with 4080 bytes of leaked memory and 16 bytes of memory that you actually use, accessing that memory will swap in the entire page.

            You can't move stuff around within a page because the address would change (moving pages is OK because all memory accesses go through the TLB [wikipedia.org]), unless you have a way of fixing up all the pointers to point at the new location. That's generally only possible in a type-safe language like Ja

      • Re: (Score:2, Informative)

        by sjames (1099)

        Personally, I like to make swap equal to the size of RAM for exactly that reason. It's not like a few Gig on a HD is a lot anymore.

        • Re:Leaky Fawcet (Score:5, Informative)

          by GooberToo (74388) on Wednesday August 11, 2010 @12:41AM (#33212436)

          Unfortunately you're not alone in doing this. Its a deprecated practice that used to make sense, but hasn't made sense to do so in a very long time.

          The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time? In the event you have legitimate applications which use large amounts of memory run away with a bug, it can effectively bring your entire system to a halt as it will take a long, long time before it runs out of memory.

          Excluding Window boxes (they have their own unique paging, memory/file mapping, and backing store systems), generally more than 1/4-1/2 memory is a waste these days. As someone else pointed out, sure you can buy more uptime from leaking applications but frankly, that's hardly realistic in the least. The chances of not requiring a kernel update over the span of a couple years is just silly unless you care more for uptime than you do for security and/or features and/or performance.

          The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was. These days, if you ever come closing to needing 1:1, let alone 2:1 page file/partition, you're not even close to properly spec'ing your required memory. In other words, with few exceptions, if you have a page file/partition anywhere near that size, you didn't understand how the machine was to be used in the first place.

          You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved. You may say the performance will be horrible with paging on top of a file system - but if you're overflowing several GB to a page file on top of a file system, the performance impact won't be noticeable as you already have far, far greater performance problems. And if the page activity isn't noticeable, the fact its on a file system won't matter.

          Three decades ago it made sense. These days, its just silly and begging for your system to one day grind to a halt.

          • Re: (Score:1, Redundant)

            by GooberToo (74388)

            Why is a factually accurate, topical, informative, and polite message marked troll?

            • 1. Programs can't use the swapped memory directly. The kernel only swaps parts of memory that haven't been accessed in a while.

              2. By swapping out unused (even because of leaks) memory, the kernel has more memory to use for disk caching.

              All this has nothing to do with whether your system will grind to a halt today instead of one month later.

              And to answer your question, the mod(s) apparently thought this was common knowledge and not worth responding to.

              • by GooberToo (74388)

                Yes, everything you said is known and understood, but hardly topical.

                By paging leaked memory, if the leak is indeed bad enough to justify an abuse of the VM to offset it, chances are you'll be suffering from fragmentation and be on the negative side of the performance curve at some point. Its just silly to believe you'll be running a badly leaking application over the span of years and desire to hide the bug rather than fix it. There is just nothing about that strategy which makes sense.

                So to bring this ful

                • I apologise, I didn't pay enough attention to the context while replying.

                  Indeed, using swapping for the sole purpose of mitigating memory leaks is wrong.

          • Re:Leaky Fawcet (Score:5, Interesting)

            by sjames (1099) on Wednesday August 11, 2010 @01:33AM (#33212610) Homepage

            I often see uptimes measured in years. It's not at all unusual for a server to need no driver updates for it's useful lifetime if you spec the hardware based on stable drivers being available. The software needs updates in that time, but not the drivers.

            In other cases, some of the drivers may need an update, but if they're modules and not for something you can't take offline (such as the disk the root filesystem is on), it's no problem to update.

            Note that I generally spec RAM so that zero swap is actually required if nothing leaks and no exceptional condition arises.

            When disks come in 2TB sizes and server boards have 6 SAS ports on them, why should I sweat 8 GB?

            Let's face it, if the swap space thrashes (yes, I know paging and swapping are distinct but it's still called swap space for hysterical raisins) it won't much matter if it is 1:1 or .5:1, performance will tank. However, it it's just leaked pages, it can be useful.

            For other situations, it makes even more sense. For example, in HPC, if you have a long running job and then a short but high priority job comes up, you can SIGSTOP the long job and let it page out. Then when the short run is over, SIGCONT it again. Yes, you can add a file at that point, but it's nice if it's already there, especially if a scheduler might make the decision to stop a process on demand. Of course, on other clusters (depending on requirements) I've configured with no swap at all.

            And since Linux can do crash dumps and can freeze into swap, it makes sense on laptops and desktops as well.

            Finally, it's useful for cases where you have RAID for availability, but don't need SO much availability that a reboot for a disk failure is a problem. In that case, best preformance suggests 2 equal sized swaps on 2 drives. If one fails, you might need a reboot, but won't have to wait on a restore from backup and you'll still have enough swap.

            Pick your poison, either way there exists a failure case.

            And yes, in the old days I went with 2:1, but don't do that anymore because it really is excessive these days.

            • Re: (Score:3, Insightful)

              by GooberToo (74388)

              I often see uptimes measured in years. It's not at all unusual for a server to need no driver updates for it's useful lifetime if you spec the hardware based on stable drivers being available. The software needs updates in that time, but not the drivers.

              Yes, we've all seen that. It makes for nice bragging rights. But realistically, to presume that one might have a badly leaking application, which can not ever be restarted, and that memory/paging fragmentation is not a consequence, to justify a poor practice is just that, a poor practice. And of course, that completely ignores the fact that there are likely nasty kernel bugs going unfixed. So it means you're advertising a poor practice, which will likely never be required, as an excuse to maintain uptime at

              • Re: (Score:1, Redundant)

                by somersault (912633)

                I'm assuming you've already heard of it, but you can use something like ksplice to patch up the kernel on the fly. It's not necessary to skip updates even if you want 100% uptime.

                • by GooberToo (74388)

                  Yes, I've read plenty on ksplice. Most distributions do not yet include it. Besides, even those that do include it are not exempt from reboot; not by a wide measure. Ksplice is only good for very minor updates which do not change any data structures. Once a structure changes, ksplice is no longer an option.

                  And assuming ksplice were always an option, it still doesn't change the fact that the justification for this entire thought experiment is as contrived and as poor a practice as they come.

            • Re:Leaky Fawcet (Score:5, Insightful)

              by vlm (69642) on Wednesday August 11, 2010 @06:04AM (#33213510)

              When disks come in 2TB sizes .... why should I sweat 8 GB?

              You are confusing capacity problems with thruput problems. Sweat how poor performance is when 8 gigs gets thrashing.

              The real problem is the ratio of memory access speed vs drive access speed has gotten dramatically worse over the past decades.

              Look at two scenarios with the same memory leak:

              With 8 gigs of glacially slow swap, true everything will keep running but performance will drop by a factor of perhaps 1000. The users will SCREAM. Which means your pager/cellphone will scream. Eventually you can log in, manually restart the processes, and the users will be happy, for a little while.

              With no/little swap, the OOM killer will reap your processes, which will be restarted automatically by your init scripts or equivalent. The users will notice the maybe, just maybe, they had to click refresh twice on a page. Or maybe it seemed slow for a moment before it was normal speed. They'll probably just blame the network guys.

              End result, with swap means long outage that needs manual fix, no swap means no outage at all and automatic fix.

              In the 80s, yes you sized your swap based on disk space. In the 10s (heck, in the 00s) you size your swap based on how long you're willing to wait.

              It takes a very atypical workload and very atypical hardware for users to tolerate the thrashing of gigs of swap...

              • With 8 gigs of glacially slow swap, true everything will keep running but performance will drop by a factor of perhaps 1000. The users will SCREAM. Which means your pager/cellphone will scream. Eventually you can log in, manually restart the processes, and the users will be happy, for a little while.

                Is there a modern OS with a VM manager that horrible? And while I agree that the ratio of memory speed to HDD speed (but not necessarily SSD speed) keeps growing in favor of RAM, the ratio of RAM size to hard drive throughput still seems about the same. For instance, my first 512KB Amiga 1000 had a 5KB/s floppy, so writing out the entire contents of RAM would take about 100 seconds. These days my home server has 8GB of RAM and each of its drives can sustain about 80MB/s throughput, so writing out the entire

                • by GooberToo (74388)

                  Is there a modern OS with a VM manager that horrible?

                  You're mis-attributing the issue. The issue is not one of a poor VM. The issue is one of a poor admin. The VM is attempting to honor the configuration which the admin provided. By providing massive page area the admin has instructed the VM/paging system to suffer massive performance loss in exchange for not returning out of memory errors.

                  These days my home server has 8GB of RAM and each of its drives can sustain about 80MB/s throughput, so writing out the entire contents of RAM would take about... 100 seconds.

                  That's an extremely simplistic way of looking at it. Paging rarely happens entirely linearly; at least not because of memory pressures. Worse, paging is triggered because o

                  • Re: (Score:3, Informative)

                    by sjames (1099)

                    You are confusing the pageout of little used (or unused but unfreed) pages as a one time event with a constant state of thrashing. The former makes no noticeable difference to system performance except that it keeps NOT ending up out of memory, the latter is a horrific performance killer that happens any time the running processes actively demand more RAM than the machine has.

                    An admin who routinely allows the machines to thrash would indeed be bad. The solution to that is adding ram or moving some services

              • by sjames (1099)

                *IF* it was thrashing the swap, it would be a terrible problem, but it's not. A leaked page swaps out and NEVER swaps back in. It's a one time event for that page. Since it is leaked, nothing even "remembers" it exists except the swap.

                If I SIGSTOP a memory hog for a bit, there is a one time hit as it gets paged out. Then another as it is paged back in. Still no thrashing.

                As I said, I do make sure that there is enough RAM for the actual memory in use. No thrashing at all.

                • by GooberToo (74388)

                  A leaked page frequently causes fragmentation. Under memory pressure you've now directly inflicted additional I/O, a loss of continuous memory, and now imposing a requirement of yet additional paging pressure.

                  The saner solution is to simply, periodically, restart your application. Followed by, getting it updated as soon as a fix is available.

                  • by sjames (1099)

                    Agreed. However since people don't always drop everything and devote their lives to fixing the bug when I file a report, it's useful to employ a workaround in the mean while. There's generally not much problem with a scheduled restart, but it's nice to know the system won't go down in flames should that be put off for a while.

                    • by GooberToo (74388)

                      "A while" was never the point of contention. The, "I do x, y, and z", to address problem which almost never happens, to prevent updating a buggy application (or simply periodically restarting it), such that one can maintain an uptime of "years" is the point of contention. Its such a point of contention, I'd call it, "complete bullshit."

                    • by sjames (1099)

                      If it means that much to you, fine. I hereby declare that I shall never in your lifetime hold a gun to your head and order you to allocate swap 1:1 with RAM!

                      There, feel better now? YEEESH!

          • Re: (Score:3, Interesting)

            by akanouras (1431981)

            Excuse my nitpicking, your post sparked some new questions for me:

            The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time?

            Are you sure that's it's only reading/writing 4KB at a time? It seems pretty braindead to me.

            The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was.

            Could you elaborate on the difference between swapping and paging? I have always thought of it (adopting the term "paging") as an effort to disconnect modern Virtual Memory implementations from the awful VM performance of Windows 3.1/9x. Wikipedia [wikipedia.org] mentions them as interchangeable terms and other sources on the web seem to agree.

            You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved.

            Just mentioning here th

            • Re: (Score:3, Informative)

              Could you elaborate on the difference between swapping and paging? I have always thought of it (adopting the term "paging") as an effort to disconnect modern Virtual Memory implementations from the awful VM performance of Windows 3.1/9x. Wikipedia mentions them as interchangeable terms and other sources on the web seem to agree.

              It's actually (buried) in the wikipedia article you link, but only a sentence or so. In the old days, before paging, a Unix system would swap an entire running program onto disk/drum. (That's where the sticky bit comes from, as swap space was typically much faster than other secondary storage, if nothing else the lack of a file system helps, the sticky bit on an executable file meant, "keep text of program on swap even when it's stopped executing". This meant that executing the program again would go much f

            • by GooberToo (74388)

              Someone else already provided an excellent summary. But in a nut shell, swapping is an entire process at a time while paging is typically least frequently used pages within that process. Swapping typically leads to linear and I/O while paging typically does not.

              Its for historic reasons why paging and swapping are frequently intermixed - just as the two implementations frequently were. Just the same, the distinction is important to understand when it comes time to allocate a paging area.

              Quoting Andrew Morton:

              Good point - and some

          • The one good use for a 1:1 memory to disk ratio now days is suspend to disk. If you don't have enough swap space available and you try to suspend, it doesn't work.
        • by GooberToo (74388)

          Sorry. My other post which provides lots of good, accurate information was troll-moderated. Its been forever since I've last seen meta-moderation actually fix a troll moderated post so I'm hoping others will fix it. Not to mention, its information many, many users should learn.

          Hopefully yourself and others will read the post and realize why its a bad idea, which ignores the fact its a popular notion.

      • by Osso (840513)

        Often you don't want to swap, and want memory allocations to fail. Sometimes everything will be slow and you can barely access your server instead of being able to check on what's happening

      • I have lately disabled my swap for a very simple reason: with 4GB of RAM, the swap was only ever used when some rouge application suddenly went into an eat-all-memory loop *cough*adobe flash*cough*.

        It might not be a good reason in theory, but in practice I rather have the OOM kick in sooner rather than having to struggle with a system that is practically hanged due to all the swapping. I can live with having slightly less of my filesystem cached.
      • by kenh (9056)

        First, it's Leaky Faucet (Unless you are thinking of Farrah Fawcett [tobyspinks.com] 8^)

        Second, never try ot teach a pig to fly, it wastes your time and annoys the pig. The same goes for having in-depth technical discussions with many slashdot commenters...

      • 2 years so updates are way behind? Not all of them are no reboot ones.

      • by ultranova (717540)

        Memory leaks usually get swapped out... your swap usage will grow but the system will keep going just as fast since those pages will never get swapped in again.

        I once had explorer.exe on Windows 7 go into some kind of seizure where it ended up using over 2 gigs of memory (of a total 4) before I killed it. It certainly was swapping in and out constantly. Fun, that.

    • by druke (1576491)

      You make it sound like this is some sort of conspiracy. Generally when you'd want to do something like this you would be doing VM servers anyways. they didn't do much (anything, actually) in the way of 'desktop programs' beyond X...

      Why does this matter anyways, it's not the vm dev's job to fix memory leaks in openoffice. They have to go forward assuming everything is working correctly. Also, if they're all sharing the memory leak, it'd be optimized anyways :p

  • Kernel shared memory (Score:5, Informative)

    by Narkov (576249) on Tuesday August 10, 2010 @11:08PM (#33212078) Homepage

    The Linux kernel uses something called kernel shared memory (KSM) to achieve this with it's virtualization technology. LWN has a great article on it:

    http://lwn.net/Articles/306704/ [lwn.net]

    • by amscanne (786530) on Wednesday August 11, 2010 @12:41AM (#33212434)

      Disclaimer: I wrote the blog post. I noticed the massive slashdot traffic, so I popped over. The article summary is not /entirely/ accurate, and doesn't really completely capture what we're trying to do with our software.

      Our mechanism for performing over-subscription is actually rather unique.

      Copper (the virtualization platform in the demo) is based on an open-source Xen-based virtualization technology named SnowFlock. Where KSM does post-processing on memory pages to share memory on a post-hoc basis, the SnowFlock method is much more similar to unix 'fork()' at the VM level.

      We actually clone a single VM into multiple ones by pausing the original VM, COW-ing its memory, and then spinning up multiple independent, divergent clones off of that memory snapshot.

      We combine this with a mechanism for bringing up lightweight VMs fetching remote memory on-demand, which allows us to bring up clones across a network about as quickly and easily as clones on the same machine. We can 'clone' a VM into 10 VMs spread across different hosts in a matter of seconds.

      So the mechanism for accomplishing this works as follows:
      1. The master VM is momentarily paused (a few milliseconds) and its memory is snapshotted.
      2. A memory server is setup to serve that snapshot up.
      3. 16 'lightweight' clone VMs are brought up with most of their memory "empty".
      4. The clones start pulling memory from the server on-demand.

      All of this takes a few seconds from start to finish, whether on the same machine or across the network.

      We're using all of this to build a bona-fide cluster operating system where you host virtual clusters which can dynamically grow and shrink on demand (in seconds, not minutes).

      The blog post was intended not as an ad, but rather a simple demo of what we're working on (memory over-subscription that leverages our unique cloning mechanism) and a general introduction to standard techniques for memory over-commit. The pointer to KSM is appreciated, I missed it in the post :)

      • by pookemon (909195) on Wednesday August 11, 2010 @01:45AM (#33212662) Homepage

        The article summary is not /entirely/ accurate

        That's surprising. That never happens on /.

      • by descubes (35093) on Wednesday August 11, 2010 @01:46AM (#33212666) Homepage

        Having written VM software myself (HP Integrity VM), I find this fascinating. Congratulations for a very interesting approach.

        That being said, I'm sort of curious how well that would work with any amount of I/O happening. If you have some DMA transfer in progress to one of the pages, you can't just snapshot the memory until the DMA completes, can you? Consider a disk transfer from a SAN. With high traffic, you may be talking about seconds, not milliseconds, no?

      • by kenh (9056)

        Let me see if I understand - you take one VM (Say, an Ubuntu 10.04 server running a LAMP stack, just to pick one), then you make "diff's" of that initial VM and create additional VMs that are also running OS/software (as a starting point), Of course, I can load up other software on the "diff'd" VMs, but they increase the actual memory footprint of each VM. So, to maximize oversubscription of memory, I'd want to limit myself to running VMs that are as similar as possible (say a farm of Ubuntu 10.04 LAMP serv

      • This is an interesting approach, especially across hosts in a cluster. Is it safe to assume you expect your hosts and interconnect to be very reliable?

        I'm curious about the methods you use to mitigate the problems that would seem to result if you clone VM 1 from Host A onto VM's 2-10 on hosts B-E, and Host A dies before the entirety of VM 1's memory is copied elsewhere. Can you shed any light on this?
      • by calmond (1284812)
        Correct me if I'm wrong, but the method you described sounds almost exactly like LVM Snapshots. A great approach, and saves a ton of disk space. How often should a VM be rebooted or re-cloned though? Memory is a lot more volitile than disk storage, so I would think that the longer the system runs, the more divergent the memory stacks would be, thus the less efficient this method would be over time, or am I missing something? Thanks!
      • Re: (Score:3, Interesting)

        by kscguru (551278)

        Disclaimer: VMware engineer; but I do like your blog post. It's one of the more accessible descriptions of cpu and memory overcommit that I have seen.

        The SnowFlock approach and VMware's approach end up making slightly different assumptions that really make each's techniques not applicable to the other. In a cluster it is advantageous to have one root VM because startup costs outweigh customization overhead; in a datacenter, each VM is different enough that the customization overhead outweighs the cost of

  • Even the same ratio of over-subscribed memory, around 300%, but without the overhead this article admits it has which reduces it's actual over-subscription ratio down to just over 200% instead:

    http://lwn.net/Articles/306704/

    Specifically, this link/LKML post: 52 1GB Windows VMs in 16GB of total physical RAM installed:

    http://lwn.net/Articles/306713/

    • Funny... in the VMware whitepaper [vmware.com] linked to from the article, even VMware wasn't able to get more than 110% memory over-consolidation from page sharing. I wonder what's so different about KVM's page sharing approach?

      • by amscanne (786530) on Tuesday August 10, 2010 @11:32PM (#33212174)
        I have one possibility. The blog post alluded to this. Page sharing can be done *much* more efficiently on Linux due to the fact that the ELF loader does not need to rewrite large chunks of the binaries when applications are loaded into memory. The Windows loader will rewrite addresses in whole sections of code if a DLL or EXE is not loaded at it's "preferred base" virtual address. In Linux, these addresses are isolated through the use of trampolines. Basically, you can have ten instances of Windows all running the exact same Microsoft Word binaries and they might not share the code for the application. In Linux, if you have ten VMs running the same binaries of Open Office, there will be a lot more sharing.
        • by ringm000 (878375)
          Base addresses of DLLs in an application are typically chosen to avoid conflicts with system DLLs and between each other, so these conflicts are relatively rare. When they happen, the DLLs can be manually rebased.
        • Using trampolines for every cross-library call seems very inefficient...

          The windows method seems better for the more common case, where it does the costly rewriting at library load time, and then avoids an extra jump for every library function call.

          Whats the performance impact of this? I bet it's at least a couple of percent, which is significant if it's across the entire system.

        • Re: (Score:3, Interesting)

          by milosoftware (654147)

          On x64 Windows systems, addressing is always relative, so this eliminates the DLL relocation. So it might actually save memory to use 64-bit guest OSses, as there will be less relocation and more sharing.

  • by Anonymous Coward

    This blog post is just a summary of 3 existing techniques: Paging, Ballooning, and Content-Based Sharing. It does not describe any new techniques, or give any new insights.

    It's a solid summary of these techniques, but nothing more.

    • by GooberToo (74388)

      A new implementation of existing technique and/or technology can still be noteworthy. If this isn't the case then an F22 is really just a Wright Brother's Flyer - nothing new. My metaphor is absurd, but you get the point.

    • It doesn't even say which if any of those techniques it's using. It's a teaser, not news.

  • OpenVZ? (Score:1, Informative)

    by Anonymous Coward

    OpenVZ has had this for years now, which is one of the reasons it has gained popularity in the hosting world.

    • by KiloByte (825081)

      Or vserver. Or BSD jails.

      These just use the good old Unix memory management -- if you can coordinate between multiple VMs, things get a whole lot easier. The problem with VMs with separate kernels (Xen, VirtualBox, VMWare, etc) is that they have no way of knowing a given page mmaps the same block on the disk.

      The technique described in the article is a hack that works only if all processes are started before you clone the VMs and nothing else happens later. Vserver does it strictly better -- if multiple V

  • nothing new (Score:1, Interesting)

    by Anonymous Coward

    nothing new .. i ran 6 w2k3 servers on a linux box running vmware server with 4GB of ram and allocated 1GB to each vm

  • Is this an ad? (Score:3, Insightful)

    by saleenS281 (859657) on Tuesday August 10, 2010 @11:25PM (#33212148) Homepage
    I noticed free memory on the system was at 2GB and dropping quickly when they moved focus away from the console session (even though all of the VM's had the exact same app set running). This appears to be absolutely nothing new or amazing... in fact, it reads like an ad for gridcentric.
  • Not exactly new (Score:1, Informative)

    by Anonymous Coward

    Oversubscription of memory for VMs has been around for decades - just not for the Intel platform. There are other older, more mature platforms for VM support...

  • by cdrguru (88047) on Tuesday August 10, 2010 @11:43PM (#33212214) Homepage

    One of the problems with folks in the computer software business today is that they are generally young and haven't had much experience with what has gone on before. Often, even when there is an opportunity to gather information about older systems, they don't think it is relevent.

    Well, here I would say it is extremely relevent to understand some of the performance tricks utilized by VM/370 and VM/SP in the 1970s and 1980s. VM/370 is pretty much the foundation of today's IBM virtualization offerings. In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.

    The key bit of information here is that for interactive users running CMS a significant optimization was put in place of sharing the bulk of the operating system pages. This was done by dividing the operating system into two parts, shared and non-shared and by design avoiding writes to the shared portion. If a page was written to a local copy was made and that page was no longer shared.

    This was extremely practical for VM/370 and later systems because all interactive users were using pretty much the same operating system - CMS. It was not unusual to have anywhere from 100 to 4000 interactive users on such systems so sharing these pages meant for huge gains in memory utilization.

    It seems to me that a reasonable implementation of this for virtualization today would be extremely powerful in that a bulk of virtualized machines are going to be running the same OS. Today most kernel pages are read-only so sharing them across multiple virtual machines would make incredible sense. So instead of booting an OS "natively" you would instead load a shared system where the shared (read only) pages would be loaded along with an initial copy of writable non-shared memory from a snapshot taken at some point during initialization of the OS.

    This would seem to be able to be done easily for Linux even to the extent of having it assist with taking a snapshot during initialization. Doing this with Windows should also be possible as well. This would greatly reduce the memory footprint of adding another virtual machine also using a shared operating system. The memory then used by a new virtual machine would only be the non-shared pages. True, the bulk of the RAM of a virtual machine might be occupied by such non-shared pages but the working set of a virtual machine is likely to be composed of a significant number of OS pages - perhaps 25% or more. Reducing memory requirements by 25% would be a significant performance gain and increase in available physical memory.

    • Yes, we pay attention...

      The concept is in Unix, including Linux, and probably in Windows - COW (copy-on-write) pages...
      fork() uses COW, vfork() shares the entire address space (but suspends the parent).

      $ man vfork

      [snip]

      Historic Description
      Under Linux, fork(2) is implemented using copy-on-write pages, so the
      only penalty incurred by fork(2) is the time and memory required to

      • by sjames (1099)

        Parent is talking about the kernel itself sharing pages across instances, not userspace processes running under a single instance.

    • by Anonymous Coward on Tuesday August 10, 2010 @11:52PM (#33212246)

      Very informative, but pure page sharing doesn't work for most Windows variants, due to the fact that Windows binaries aren't position independent. That means, each time MS Office is loaded on a different machine, the function jump points are re-written according to where in the address space the code gets loaded, which is apparently usually different on different Windows instances. That means very little opportunity for page sharing.

      These guys seem to be doing something different...

      • by ChipMonk (711367)

        Very informative, but pure page sharing doesn't work for most Windows variants, due to the fact that Windows binaries aren't position independent.

        Is that also true for 64-bit Windows binaries? According to the docs I've read, position-independent binary code is preferred in 64-bits.

    • by pz (113803) on Wednesday August 11, 2010 @12:16AM (#33212334) Journal

      In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.

      From what I can gather, it was not Cambridge University (of which I believe there is still only one, located in the UK, despite the similarily-named Cambridge College [cambridgecollege.edu] in Cambridge, Massachusetts, but as the latter is an adult-educational center founded in 1971, the chances are that wasn't where CP/67 was developed), but rather IBM's Cambridge Scientific Center [wikipedia.org] that used to be in the same building as MIT's Project MAC. Project MAC (becoming later the MIT Lab for Computer Science) being where much of the structure of modern OSes was invented.

      Those were heady days for Tech Square. And, otherwise, the parent poster is right on.

  • This may seem obvious, but in reading some of the trade press and the general buzz, it seems that it isn't obvious to everyone:

    Oversubscription only works when the individual VMs aren't doing much. If you have a pile of VMs oversubscribed to the degree TFA is talking about, it means the VM overhead is exceeding the useful computation. There are cases where that can't be helped, such as each VM is a different customer, but in an enterprise environment, it suggests that you should be running more than one ser

    • by drsmithy (35869)

      Oversubscription only works when the individual VMs aren't doing much. If you have a pile of VMs oversubscribed to the degree TFA is talking about, it means the VM overhead is exceeding the useful computation. There are cases where that can't be helped, such as each VM is a different customer, but in an enterprise environment, it suggests that you should be running more than one service per instance and have less instances.

      No, you ideally want as few services per instance as possible, to reduce dependenci

      • by sjames (1099)

        No, you ideally want as few services per instance as possible, to reduce dependencies and simplify architectures.

        If the services can be separated onto 2 VMs, they are necessarily orthogonal. If the availability requirements differ, they should certainly NOT be running as VMs on the same machine.

        As for the case of different customers, that would fall under the exception where it can't be helped.

        There are numerous examples where multiple clustered VMs will perform better than a single OS image, on the same hardware.

        Name one!

      • by jon3k (691256)

        There are numerous examples where multiple clustered VMs will perform better than a single OS image, on the same hardware.

        While I agree that there are numerous advantages to taking a single host performing many functions and splitting it into many VMs performing individual functions (UNIX philosophy afterall - do one thing and do it well!) but I have yet to find an instance that bears out that statement you made above. What workload would perform better in that instance? If I ran one webserver on bare metal or 4 VMs each running an instance of that same webserver on the same hardware, it will perform better on bare metal (du

    • by jon3k (691256)

      I swear, some in the trade rags seem to honestly think there is a benefit to splitting a server into 16 VMs and then combining those into a virtual beowulf cluster for production work (it makes perfect sense for development and testing, of course).

      There are a number of benefits to what you just described.

      • Application compatability issues - much easier to troubleshoot a problem when each guest is performing a single function
      • Performance - you can move VMs around without rebooting using vMotion if one VM begins to consume too many resources
      • Upgrades/Maintenance - you can upgrade software or even a single host OS and reboot a guest without affecting the other guests.

      In practice these are exceptionally useful features and worth the small performance hit

      • by sjames (1099)

        Absolutely none of those things apply to a beowulf cluster.

        Understand, I'm not saying VMs have no place at all, they certainly do. It's just that they are not the end all and be all of computing they are cracked up to be in the trade rags. There is always a performance penalty for using a VM overall. It may be worth it but that doesn't mean it isn't there.

        • by jon3k (691256)
          Oh you literally meant a beowulf cluster, that's a horrible idea who does that? I'm against taking a workload on bare metal, creating multiple VMs on that same hardware and running identical workloads across all those guests. At least for performance reasons. I've seen people do things like ms exchange clusters in a box (1 server 2 VMs each running exchange enterprise) and there's at least an argument to be made there, but of course it isn't one for performance.
  • When can we just effectively get what we pay for? This would explain the sudden jump in Intel-based Camfrog servers with a higher offering of hardware.

    This effectively means people can now lie about the hardware they're leasing out to you in a data center. They say you're getting 4GB, you're actually getting 1.5GB of RAM.

    Our internet is oversubscribed, our processors are getting there, and now RAM?

    When are the designers of this stuff going to just build the fucking hardware instead of trying to lie about it

    • When are the designers of this stuff going to just build the fucking hardware instead of trying to lie about it?

      When people are willing to pay for it. If you shop on price, this is the natural result, the need to squeeze as much as you can out of a capital asset.

      • by sjames (1099)

        Sad but true, especially when there's always someone out there ready to promise more for less and customers ready to believe the lie.

    • by gregrah (1605707)
      Really? That was your conclusion upon reading this article??

      Virtual memory has been around for quite a while now, and I don't think its inventors came up with the idea with the intention to scam anyone. I'd your outrage at "the designers of this stuff" may be misplaced.
      • by Khyber (864651)

        Not when this is apparently the exact same technology being used to run multiple heavy-traffic video chat servers on the same physical silicon. No wonder people on Camfrog are complaining about their servers lagging so hard, if this is the kind of thing we're paying for when we're actually expecting physical hardware.

    • by Slashcrap (869349)

      When can we just effectively get what we pay for? This would explain the sudden jump in Intel-based Camfrog servers with a higher offering of hardware.

      This effectively means people can now lie about the hardware they're leasing out to you in a data center. They say you're getting 4GB, you're actually getting 1.5GB of RAM.

      Our internet is oversubscribed, our processors are getting there, and now RAM?

      When are the designers of this stuff going to just build the fucking hardware instead of trying to lie about it?

      Sorry about your anger issues and obvious lack of understanding about what this is.

      • by Khyber (864651)

        No, I know EXACTLY what the issue is, having called my hosting provider for my video chat server. They just upgraded to this sort of management system, and my video server had been lagging horribly almost since the moment of implementation. And this would explain it - I've been moved to a shared server with overprovisioned hardware.

        Sorry you're not experienced enough with realtime applications to know when something's fucking with your system.

        • Re: (Score:3, Insightful)

          by TheRaven64 (641858)

          Nope, you really don't seem to understand what this is at all. It is eliminating duplicated pages in the system, so if two VMs have memory pages with the same contents the system only keeps one copy. To a VM, this makes absolutely no difference - the pages are copy-on-write, and when neither VM modifies them they both can see the same one without any interference (as is common with mapped process images, kernel stuff, and so on). The only thing that will change is that there will be reduced cache conten

          • by Khyber (864651)

            You still are in the wrong direction, Arctic cold.

            Let's take a machine with 4 VMs. only 2GB RAM. Running camfrog video servers.

            Only about 20MB will be used for the same program - everything else is a constant video stream and thus can't be swapped out to a disk cache.

            With Camfrog, anything higher than 300 video streams will max out that 2GB RAM.

            4 VMs all streaming that many video streams with only 2GB of physical memory WILL NOT WORK.

            Run a bunch of realtime programs as intensive as a real time thousand+ vid

  • I had recently started poking around the lguest hypervisor. From my limited reading I believe 2 of the 3 memory subscription choices mentioned in the article are present in Linux. Existing linux based open source hypervisors like kvm etc use paging/swap mechanism (i,e, for x86 - the paravirt mechanism). Ballooning is possible using the virto_balloon. Kernel shared memory in linux allows dynamic sharing of memory pages between proceses - this probably doesn't apply to virtualization.

    I couldn't find any C

    • by Slashcrap (869349)

      I couldn't find any CPU over-subscription thing in open-source hypervisors. It seems to be the only area where open-source hypervisors are lacking.

      Didn't look too hard, did you?

    • by DrPizza (558687)

      If your load average is >1, you have CPU over-subscription....

      • by idiot900 (166952) *

        Actually, if the load average is greater than the number of cores, you have oversubscription.

      • by jon3k (691256)
        Uh, no. Load average can also refer to outstanding I/O (think disk, network, etc). CPU utilization could be 2% but you could have a load average of 500.
  • VMware has allowed RAM oversubscription for years. Indeed, it's one of the killer features of that platform over the alternatives. Who out there using VMware in non-trivial environments _isn't_ oversubscribing RAM ?
    • by swb (14022)

      We usually advise against it if possible, but some of that is consulting CYA; when clients are new to virtualization they are often very sensitive to perceived performance differences between physical and virtual systems. A new virtual environment where someone decided they wanted 8 Windows machines with 8 GB RAM running in 32 GB physical RAM usually gets too far oversubscribed, swaps hard (on a SAN) and the customer complains mightily.

      Usually we find that a little tuning of VMs makes sense, since you don'

      • by drsmithy (35869)

        We usually advise against it if possible, but some of that is consulting CYA; when clients are new to virtualization they are often very sensitive to perceived performance differences between physical and virtual systems. A new virtual environment where someone decided they wanted 8 Windows machines with 8 GB RAM running in 32 GB physical RAM usually gets too far oversubscribed, swaps hard (on a SAN) and the customer complains mightily.

        Well, that (swapping) should only happen if the VMs really do need and

  • Really, this amount of overcommit is nothing. It's been done for decades.

    I manage a little over 200 virtual servers, spread across 7 z/VM hypervisors, and 2 mainframes. They are currently running with overcommit ratios of 4.59:1, 3.87:1, 3.56:1, 2.05:1, 1.19:1, 1.19:1, and .9:1. And this is a relatively small shop and somewhat low overcommits for the environment.

    That's one of the benefits of virtualization...and yes, I know that if all guests decided to allocate all of their memory at once, we'd drive th

In order to get a loan you must first prove you don't need it.

Working...