Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Operating Systems Software Technology

Extreme Memory Oversubscription For VMs 129

Laxitive writes "Virtualization systems currently have a pretty easy time oversubscribing CPUs (running lots of VMs on a few CPUs), but have had a very hard time oversubscribing memory. GridCentric, a virtualization startup, just posted on their blog a video demoing the creation of 16 one-gigabyte desktop VMs (running X) on a computer with just 5 gigs of RAM. The blog post includes a good explanation of how this is accomplished, along with a description of how it's different from the major approaches being used today (memory ballooning, VMWare's page sharing, etc.). Their method is based on a combination of lightweight VM cloning (sort of like fork() for VMs) and on-demand paging. Seems like the 'other half' of resource oversubscription for VMs might finally be here."
This discussion has been archived. No new comments can be posted.

Extreme Memory Oversubscription For VMs

Comments Filter:
  • nothing new (Score:1, Interesting)

    by Anonymous Coward on Wednesday August 11, 2010 @12:21AM (#33212130)

    nothing new .. i ran 6 w2k3 servers on a linux box running vmware server with 4GB of ram and allocated 1GB to each vm

  • Re:Leaky Fawcet (Score:5, Interesting)

    by Mr Z ( 6791 ) on Wednesday August 11, 2010 @12:48AM (#33212232) Homepage Journal

    Sometimes that doesn't work out so well. If you have a fragmented heap with gaps between the leaked items that keep getting reused, it can lead to a lot of strange thrashing, since it effectively amplifies your working set size.

    I think that may be one of the things that was happening to older Firefoxes (2.x when viewing gmail, in particular)... not only did it leak memory, it leaked memory in a way such that the leak couldn't just stay in swap.

  • Re:Leaky Fawcet (Score:5, Interesting)

    by sjames ( 1099 ) on Wednesday August 11, 2010 @02:33AM (#33212610) Homepage Journal

    I often see uptimes measured in years. It's not at all unusual for a server to need no driver updates for it's useful lifetime if you spec the hardware based on stable drivers being available. The software needs updates in that time, but not the drivers.

    In other cases, some of the drivers may need an update, but if they're modules and not for something you can't take offline (such as the disk the root filesystem is on), it's no problem to update.

    Note that I generally spec RAM so that zero swap is actually required if nothing leaks and no exceptional condition arises.

    When disks come in 2TB sizes and server boards have 6 SAS ports on them, why should I sweat 8 GB?

    Let's face it, if the swap space thrashes (yes, I know paging and swapping are distinct but it's still called swap space for hysterical raisins) it won't much matter if it is 1:1 or .5:1, performance will tank. However, it it's just leaked pages, it can be useful.

    For other situations, it makes even more sense. For example, in HPC, if you have a long running job and then a short but high priority job comes up, you can SIGSTOP the long job and let it page out. Then when the short run is over, SIGCONT it again. Yes, you can add a file at that point, but it's nice if it's already there, especially if a scheduler might make the decision to stop a process on demand. Of course, on other clusters (depending on requirements) I've configured with no swap at all.

    And since Linux can do crash dumps and can freeze into swap, it makes sense on laptops and desktops as well.

    Finally, it's useful for cases where you have RAID for availability, but don't need SO much availability that a reboot for a disk failure is a problem. In that case, best preformance suggests 2 equal sized swaps on 2 drives. If one fails, you might need a reboot, but won't have to wait on a restore from backup and you'll still have enough swap.

    Pick your poison, either way there exists a failure case.

    And yes, in the old days I went with 2:1, but don't do that anymore because it really is excessive these days.

  • by descubes ( 35093 ) on Wednesday August 11, 2010 @02:46AM (#33212666) Homepage

    Having written VM software myself (HP Integrity VM), I find this fascinating. Congratulations for a very interesting approach.

    That being said, I'm sort of curious how well that would work with any amount of I/O happening. If you have some DMA transfer in progress to one of the pages, you can't just snapshot the memory until the DMA completes, can you? Consider a disk transfer from a SAN. With high traffic, you may be talking about seconds, not milliseconds, no?

  • Re:Leaky Fawcet (Score:3, Interesting)

    by akanouras ( 1431981 ) on Wednesday August 11, 2010 @04:05AM (#33212894)

    Excuse my nitpicking, your post sparked some new questions for me:

    The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time?

    Are you sure that's it's only reading/writing 4KB at a time? It seems pretty braindead to me.

    The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was.

    Could you elaborate on the difference between swapping and paging? I have always thought of it (adopting the term "paging") as an effort to disconnect modern Virtual Memory implementations from the awful VM performance of Windows 3.1/9x. Wikipedia [wikipedia.org] mentions them as interchangeable terms and other sources on the web seem to agree.

    You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved.

    Just mentioning here that Swapspace [pqxx.org] (Debian [debian.org] package) takes care of that, with configurable thresholds.

    You may say the performance will be horrible with paging on top of a file system - but if you're overflowing several GB to a page file on top of a file system, the performance impact won't be noticeable as you already have far, far greater performance problems. And if the page activity isn't noticeable, the fact its on a file system won't matter.

    Quoting Andrew Morton [lkml.org]:
    "[On 2.6 kernels the difference is] None at all. The kernel generates a map of swap offset -> disk blocks at swapon time and from then on uses that map to perform swap I/O directly against the underlying disk queue, bypassing all caching, metadata and filesystem code."

  • by milosoftware ( 654147 ) on Wednesday August 11, 2010 @05:51AM (#33213218) Homepage

    On x64 Windows systems, addressing is always relative, so this eliminates the DLL relocation. So it might actually save memory to use 64-bit guest OSses, as there will be less relocation and more sharing.

  • by kscguru ( 551278 ) on Wednesday August 11, 2010 @01:15PM (#33217432)

    Disclaimer: VMware engineer; but I do like your blog post. It's one of the more accessible descriptions of cpu and memory overcommit that I have seen.

    The SnowFlock approach and VMware's approach end up making slightly different assumptions that really make each's techniques not applicable to the other. In a cluster it is advantageous to have one root VM because startup costs outweigh customization overhead; in a datacenter, each VM is different enough that the customization overhead outweighs the cost of starting a whole new VM. Particularly with Windows: a Windows VM essentially needs to be rebooted to be customized (and thus the memory server stops being useful), whereas Linux can more easily customize on-the-fly. Different niches of the market.

    The second big difference is architectural. VMware handles more in the virtual machine monitor; KVM and Xen use simpler virtual machine monitors that offload the complex tasks to a parent partition. This means that for VMware, each additional VM instance takes ~100MB of hypervisor overhead - small relative to non-idle VMs, but large relative to idle VMs. It's purely an engineering tradeoff: a design like VMware's vmm will always be (a little bit) quicker per-VM; a design like KVM/Xen's vmm will always scale (a little bit) better with idle VMs.

    These combine to make it easy to show KVM/Xen hypervisors more deeply overcommitted than VMware hypervisors by using only idle Linux VMs. VMware doesn't care about such numbers, because the difference disappears or favors VMware as load increases. If GridCentric has found a business for deeply overcommitted VMs, more power to you!

After an instrument has been assembled, extra components will be found on the bench.

Working...