Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Operating Systems Software Technology

Extreme Memory Oversubscription For VMs 129

Laxitive writes "Virtualization systems currently have a pretty easy time oversubscribing CPUs (running lots of VMs on a few CPUs), but have had a very hard time oversubscribing memory. GridCentric, a virtualization startup, just posted on their blog a video demoing the creation of 16 one-gigabyte desktop VMs (running X) on a computer with just 5 gigs of RAM. The blog post includes a good explanation of how this is accomplished, along with a description of how it's different from the major approaches being used today (memory ballooning, VMWare's page sharing, etc.). Their method is based on a combination of lightweight VM cloning (sort of like fork() for VMs) and on-demand paging. Seems like the 'other half' of resource oversubscription for VMs might finally be here."
This discussion has been archived. No new comments can be posted.

Extreme Memory Oversubscription For VMs

Comments Filter:
  • Kernel shared memory (Score:5, Informative)

    by Narkov ( 576249 ) on Wednesday August 11, 2010 @12:08AM (#33212078) Homepage

    The Linux kernel uses something called kernel shared memory (KSM) to achieve this with it's virtualization technology. LWN has a great article on it:

    http://lwn.net/Articles/306704/ [lwn.net]

  • by Anonymous Coward on Wednesday August 11, 2010 @12:14AM (#33212100)

    This blog post is just a summary of 3 existing techniques: Paging, Ballooning, and Content-Based Sharing. It does not describe any new techniques, or give any new insights.

    It's a solid summary of these techniques, but nothing more.

  • OpenVZ? (Score:1, Informative)

    by Anonymous Coward on Wednesday August 11, 2010 @12:20AM (#33212122)

    OpenVZ has had this for years now, which is one of the reasons it has gained popularity in the hosting world.

  • by amscanne ( 786530 ) on Wednesday August 11, 2010 @12:32AM (#33212174)
    I have one possibility. The blog post alluded to this. Page sharing can be done *much* more efficiently on Linux due to the fact that the ELF loader does not need to rewrite large chunks of the binaries when applications are loaded into memory. The Windows loader will rewrite addresses in whole sections of code if a DLL or EXE is not loaded at it's "preferred base" virtual address. In Linux, these addresses are isolated through the use of trampolines. Basically, you can have ten instances of Windows all running the exact same Microsoft Word binaries and they might not share the code for the application. In Linux, if you have ten VMs running the same binaries of Open Office, there will be a lot more sharing.
  • Re:Leaky Fawcet (Score:5, Informative)

    by ls671 ( 1122017 ) * on Wednesday August 11, 2010 @12:35AM (#33212184) Homepage

    Memory leaks usually get swapped out... your swap usage will grow but the system will keep going just as fast since those pages will never get swapped in again. I have tried several times to explain that to some slashdotters that bragged about not using any swap space anymore nowadays and that called me stupid for reserving a 2 gig swap partition or more on a 4 gig ram machine that sometimes runs for 2 years before rebooting.

    Oh well....

  • Not exactly new (Score:1, Informative)

    by Anonymous Coward on Wednesday August 11, 2010 @12:38AM (#33212198)

    Oversubscription of memory for VMs has been around for decades - just not for the Intel platform. There are other older, more mature platforms for VM support...

  • by cdrguru ( 88047 ) on Wednesday August 11, 2010 @12:43AM (#33212214) Homepage

    One of the problems with folks in the computer software business today is that they are generally young and haven't had much experience with what has gone on before. Often, even when there is an opportunity to gather information about older systems, they don't think it is relevent.

    Well, here I would say it is extremely relevent to understand some of the performance tricks utilized by VM/370 and VM/SP in the 1970s and 1980s. VM/370 is pretty much the foundation of today's IBM virtualization offerings. In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.

    The key bit of information here is that for interactive users running CMS a significant optimization was put in place of sharing the bulk of the operating system pages. This was done by dividing the operating system into two parts, shared and non-shared and by design avoiding writes to the shared portion. If a page was written to a local copy was made and that page was no longer shared.

    This was extremely practical for VM/370 and later systems because all interactive users were using pretty much the same operating system - CMS. It was not unusual to have anywhere from 100 to 4000 interactive users on such systems so sharing these pages meant for huge gains in memory utilization.

    It seems to me that a reasonable implementation of this for virtualization today would be extremely powerful in that a bulk of virtualized machines are going to be running the same OS. Today most kernel pages are read-only so sharing them across multiple virtual machines would make incredible sense. So instead of booting an OS "natively" you would instead load a shared system where the shared (read only) pages would be loaded along with an initial copy of writable non-shared memory from a snapshot taken at some point during initialization of the OS.

    This would seem to be able to be done easily for Linux even to the extent of having it assist with taking a snapshot during initialization. Doing this with Windows should also be possible as well. This would greatly reduce the memory footprint of adding another virtual machine also using a shared operating system. The memory then used by a new virtual machine would only be the non-shared pages. True, the bulk of the RAM of a virtual machine might be occupied by such non-shared pages but the working set of a virtual machine is likely to be composed of a significant number of OS pages - perhaps 25% or more. Reducing memory requirements by 25% would be a significant performance gain and increase in available physical memory.

  • by Anonymous Coward on Wednesday August 11, 2010 @12:52AM (#33212246)

    Very informative, but pure page sharing doesn't work for most Windows variants, due to the fact that Windows binaries aren't position independent. That means, each time MS Office is loaded on a different machine, the function jump points are re-written according to where in the address space the code gets loaded, which is apparently usually different on different Windows instances. That means very little opportunity for page sharing.

    These guys seem to be doing something different...

  • Re:Leaky Fawcet (Score:2, Informative)

    by sjames ( 1099 ) on Wednesday August 11, 2010 @12:54AM (#33212254) Homepage Journal

    Personally, I like to make swap equal to the size of RAM for exactly that reason. It's not like a few Gig on a HD is a lot anymore.

  • by pz ( 113803 ) on Wednesday August 11, 2010 @01:16AM (#33212334) Journal

    In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.

    From what I can gather, it was not Cambridge University (of which I believe there is still only one, located in the UK, despite the similarily-named Cambridge College [cambridgecollege.edu] in Cambridge, Massachusetts, but as the latter is an adult-educational center founded in 1971, the chances are that wasn't where CP/67 was developed), but rather IBM's Cambridge Scientific Center [wikipedia.org] that used to be in the same building as MIT's Project MAC. Project MAC (becoming later the MIT Lab for Computer Science) being where much of the structure of modern OSes was invented.

    Those were heady days for Tech Square. And, otherwise, the parent poster is right on.

  • by amscanne ( 786530 ) on Wednesday August 11, 2010 @01:41AM (#33212434)

    Disclaimer: I wrote the blog post. I noticed the massive slashdot traffic, so I popped over. The article summary is not /entirely/ accurate, and doesn't really completely capture what we're trying to do with our software.

    Our mechanism for performing over-subscription is actually rather unique.

    Copper (the virtualization platform in the demo) is based on an open-source Xen-based virtualization technology named SnowFlock. Where KSM does post-processing on memory pages to share memory on a post-hoc basis, the SnowFlock method is much more similar to unix 'fork()' at the VM level.

    We actually clone a single VM into multiple ones by pausing the original VM, COW-ing its memory, and then spinning up multiple independent, divergent clones off of that memory snapshot.

    We combine this with a mechanism for bringing up lightweight VMs fetching remote memory on-demand, which allows us to bring up clones across a network about as quickly and easily as clones on the same machine. We can 'clone' a VM into 10 VMs spread across different hosts in a matter of seconds.

    So the mechanism for accomplishing this works as follows:
    1. The master VM is momentarily paused (a few milliseconds) and its memory is snapshotted.
    2. A memory server is setup to serve that snapshot up.
    3. 16 'lightweight' clone VMs are brought up with most of their memory "empty".
    4. The clones start pulling memory from the server on-demand.

    All of this takes a few seconds from start to finish, whether on the same machine or across the network.

    We're using all of this to build a bona-fide cluster operating system where you host virtual clusters which can dynamically grow and shrink on demand (in seconds, not minutes).

    The blog post was intended not as an ad, but rather a simple demo of what we're working on (memory over-subscription that leverages our unique cloning mechanism) and a general introduction to standard techniques for memory over-commit. The pointer to KSM is appreciated, I missed it in the post :)

  • Re:Leaky Fawcet (Score:5, Informative)

    by GooberToo ( 74388 ) on Wednesday August 11, 2010 @01:41AM (#33212436)

    Unfortunately you're not alone in doing this. Its a deprecated practice that used to make sense, but hasn't made sense to do so in a very long time.

    The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time? In the event you have legitimate applications which use large amounts of memory run away with a bug, it can effectively bring your entire system to a halt as it will take a long, long time before it runs out of memory.

    Excluding Window boxes (they have their own unique paging, memory/file mapping, and backing store systems), generally more than 1/4-1/2 memory is a waste these days. As someone else pointed out, sure you can buy more uptime from leaking applications but frankly, that's hardly realistic in the least. The chances of not requiring a kernel update over the span of a couple years is just silly unless you care more for uptime than you do for security and/or features and/or performance.

    The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was. These days, if you ever come closing to needing 1:1, let alone 2:1 page file/partition, you're not even close to properly spec'ing your required memory. In other words, with few exceptions, if you have a page file/partition anywhere near that size, you didn't understand how the machine was to be used in the first place.

    You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved. You may say the performance will be horrible with paging on top of a file system - but if you're overflowing several GB to a page file on top of a file system, the performance impact won't be noticeable as you already have far, far greater performance problems. And if the page activity isn't noticeable, the fact its on a file system won't matter.

    Three decades ago it made sense. These days, its just silly and begging for your system to one day grind to a halt.

  • Re:Leaky Fawcet (Score:3, Informative)

    by lars_stefan_axelsson ( 236283 ) on Wednesday August 11, 2010 @06:19AM (#33213312) Homepage

    Could you elaborate on the difference between swapping and paging? I have always thought of it (adopting the term "paging") as an effort to disconnect modern Virtual Memory implementations from the awful VM performance of Windows 3.1/9x. Wikipedia mentions them as interchangeable terms and other sources on the web seem to agree.

    It's actually (buried) in the wikipedia article you link, but only a sentence or so. In the old days, before paging, a Unix system would swap an entire running program onto disk/drum. (That's where the sticky bit comes from, as swap space was typically much faster than other secondary storage, if nothing else the lack of a file system helps, the sticky bit on an executable file meant, "keep text of program on swap even when it's stopped executing". This meant that executing the program again would go much faster). Then came paging, where only certain pages of a running program would get ejected to swap space.

    Unix systems would then both swap and page. Roughly, when memory pressure was low (but still high enough to demand swap space), the system would page. As memory pressure rose, the OS would decide the situation to be untenable and select entire processes to be evicted to swap for a long time (several seconds to tens of seconds) and then check periodically to see if they could/should be brought back (evicting someone else in the process). The BSDs even divided the task struct into two parts, the swappable and the unswappable part. Where the swappable part would record things like page tables etc. that is superfluous information when all the pages of a process have been ejected. The unswappable part contained only the bare minimum needed to remember there was a process on swap, and to make scheduling decisions regarding it. This made sense when main memory was measured in single digit megabytes, I don't think that Linux bothered with this (or even swapping as a concept, implementing just paging, but don't quote me on that, as memories were becoming bigger fast).

    Of course, swapping meant that those of us that ran X on a 4MB Sun system in the eighties would find that our X-term processes had been swapped out (the OS had decided that since they hadn't been used in a while, and where waiting for I/O, they were probably batch oriented in nature and could be swapped out wholesale) and it would take several seconds for the cursor to become responsive when you changed windows... :-) The scheduling decisions hadn't kept up. The solution though was the same as today, buy more memory... :-)

    Any good *old* book on OS internals, esp. the earlier incantations of "The Design and Implementation of the FreeBSD Operating System" by McCusick et.al. would have the gory details. (But the FreeBSD version of that book might have done away with that. It was still in the 4.2 version though.) :-)

  • by petermgreen ( 876956 ) <plugwash@nOSpam.p10link.net> on Wednesday August 11, 2010 @09:50AM (#33214792) Homepage

    Memory is cheap
    Kind of, the memory itself isn't too expensive but the cost of a system has a highly nonlinear relationship to memory requirements at least with the intel nahelm stuff (it's been a while since i've looked at AMD so I can't really commend there).

    Up to 16GB you can use an ordinary LGA1366 board and CPU.

    To get to 24GB you need a LGA1366 board and CPU.

    To get to 48GB (or 72GB if you are prepared to take the performance hit and motherboard choice hit that comes from putting three memory modules on a channel) you need a dual-socket LGA1366 board and associated dual-socket capable CPUs (which are far far more expensive clock for clock than thier single socket equivilents) and associated speial case.

    To get to 96GB (or 144GB if you are prepared to take the performance hit and motherboard choice hit that comes from putting three memory modules on a channel) you need the aforementioned dual-socket platform plus insanely expensive 8GB modules.

    Beyond that you are talking moving to a quad-socket platform afaict.

  • ESXi is free, though (Score:3, Informative)

    by the_doctor_23 ( 945852 ) on Wednesday August 11, 2010 @11:14AM (#33215732)
    True, ESX is not free, but ESXi (http://www.vmware.com/products/vsphere-hypervisor/index.html [vmware.com]) certainly is. If you have just one box and thus do not need stuff like HA or Vmotion, ESXi works just fine.
  • by petermgreen ( 876956 ) <plugwash@nOSpam.p10link.net> on Wednesday August 11, 2010 @11:18AM (#33215776) Homepage

    Up to 16GB you can use an ordinary LGA1366 board and CPU.
    That line should have said LGA1156

  • Re:Leaky Fawcet (Score:3, Informative)

    by sjames ( 1099 ) on Wednesday August 11, 2010 @12:39PM (#33216826) Homepage Journal

    You are confusing the pageout of little used (or unused but unfreed) pages as a one time event with a constant state of thrashing. The former makes no noticeable difference to system performance except that it keeps NOT ending up out of memory, the latter is a horrific performance killer that happens any time the running processes actively demand more RAM than the machine has.

    An admin who routinely allows the machines to thrash would indeed be bad. The solution to that is adding ram or moving some services to another machine. Shrinking swap will not help, it will just trigger total failure sooner (but really the machine has already functionally failed since it is thrashing no matter how much or little swap it has).

    An admin who doesn't provide enough swap to handle little used or unused pages that are still allocated due to some silly superstition that the server will suddenly start thrashing because he didn't use his version of the golden ratio (but all would be just peachy otherwise) is a bad admin.

8 Catfish = 1 Octo-puss

Working...