Tim: Bill, we know the TACC among other things is home to some high performance clusters. And some of these have actually been on the top 500 list. Can we talk about the newest of these, which is Stampede, which is the system you are standing in front of?
Bill: Sure, so the system behind me is Stampede. It is a Dell Intel Mellanox primarily vendor system. It is 6,400 compute nodes plus a variety of additional servers and administrative support nodes, I/O nodes etc. Each node is 2 Intel Xeon Sandy Bridge sockets with 16 cores total, and then every node has at least one Intel Xeon Phi card in it. Some of the nodes, 480 of them have an additional Intel Xeon Phi, so we are working on dual Phi, and then 128 of the nodes also have an 1 Xeon Phi and 1 Nvidia Kepler K20 as well. And so we’ve got a lot of different accelerator models that can be tried out in the system. And the system sits based on all of those accelerators at number six on the top 500 as of the list that came out at ISC in January.
Tim: What do you mean by accelerator model?
Bill: So every node in Stampede has a pair of regular Xeon processors on it. It has also got this Xeon Phi card or Kepler card in it where additional processing can be offloaded to the card. So each card is about a teraflop and each node by itself is about 300 gigaflops so there is an acceleration possible there based on the extra resources that are available. But they are not required. So it is sort of a coprocessing model. If you want to use it, it is there. If you have a code that is appropriate for it, it is there. If you have written the additional things that have to happen, you can offload your extra flops to the card itself.
Tim: This must be one of the biggest installations of that sort.
Bill: It was the biggest until the Chinese put one out – it is a bit bigger system in terms of compute nodes, it is number one on the top 500 list. And it has got 3 Intel Xeon Phi per node instead of ours – practically one.
Tim: That just came out last month the announcement.
Bill: It was just announced in June at the International Supercomputing Conference in Leipzig in Germany. It was kind of a dark horse thing, we kind of knew that something was coming in Asia but we didn’t know how big it would be. The Chinese have been doing this recently. Tianhe-1A the Nebulae system before that, and so now Tianhe-2 just big accelerator play from Chinese.
Female Speaker: You want to clarify that Stampede was being the first large scale installation in the Xeon lab?
Bill: I think that is really a good point. So Stampede was the first large scale deployment of Intel Xeon Phi with our 6,800 cards roughly, and now we have got this Chinese system out there as well.
Tim: Now with the model of having accelerators built into each of these units, does the software have to be written for that?
Bill: Somewhat, yeah. So the Xeon Phi is nice because it is basically an x86 instruction set. It looks like a Linux node. It runs Linux, it runs ____3:18 you can log in to the card, which is SSH too, it sits on a fake TCP/IP network. But to really get the performance out of it, you can just log in, compile your code for the Phi, with a cross compiling switch for the compiler and then you can take your code over there and run it. But to really get performance you’ve got to do some extra work.
There is some code restructuring, you may have to add some open MPD or your code where you don’t have it now. There is vectorization to take care of things like that. So you got to be more careful with the way you program, maybe do some extra work to make some things perform better on the card. But most of the codes that we have port from Xeon to Xeon Phi in minutes to hours. But then getting performance that is a real process. It is similar to the kind of work you do for GPU, but you write it in C or Fortran instead of having to go to something like Cuda.
Tim: With that kind of optimization, you can’t be a huge pool of people in a room to know how to do it, is that something that people like you here at Tech are the layer where that optimization is done, or are there tools out there for researchers to submit projects, have they already can they know how to optimize
Bill: Yes to all of those questions in fact. So we have a very diverse user community. And so some people actually are really quite savvy and have already done some GPU work so the kind of work that they need to do to restructure the codes is similar if not the same. But some people know exactly what they are doing, they get started, they hit the ground running, they are already producing results that have been presented at conferences. Some other people will need to come to a training, and we’ll teach them how to do it; some people may apply for some sort of advanced consulting where we will help them work on our code.
So we have a full range of options for people. Some people will be complete neophytes. This would be the first time on a cluster ever, and they will come to the system, and they may not even use the Phi. They may just jump on these great Sandy Bridge processors that we’ve got and get going there first and then eventually in time, when they want some more performance, start investigating how to use the Phi. It is mostly a training and documentation model with some advanced consulting but we don’t have enough or thousands of users so we don’t really have enough people to help everyone personally.
Tim: You touched a little bit on the operating environment, so you can SSH into a node even if it’s basically a virtual address that it is living on
Tim: I would like you to touch on that a little bit more. How much of the operating environment for an end user, how much is off-the-shelf tools and free software that is out there, like Linux tools, and how much of it has been evolved here at your tech?
Bill: So 99% of it is off the shelf. Some of it is commercial. We use the Intel compiler, GNU compilers are available too. But for the best optimization for Sandy Bridge and for Phi we really need the work that Intel put into it. But most of the rest of the tools are off the shelf. So we have it is a CentOS based environment. People log in to a login node, they do their compiling, and they move their data around, they set up their jobs, and then they submit a BASH or C shell script to a scheduling system called SLURM and that system takes in all of these job requests and schedules them.
There is a queuing system and then that job goes off and runs on whatever compute nodes they asked for. And that is also a complete CentOS environment. They can also get an interactive shell on compute nodes, and so they can interact directly with these parallel jobs that they are running if they want to. We try to ask people not to run on the login nodes because there may be dozens or hundreds of people logged in at the same time, trying to use a beefy but shared resource. So we don’t want them running their chemistry application right in that login environment.
Tim: How many people are typically logged in at a time?
Bill: A few,100, 120, 150, 200 somewhere in that range. It varies a lot. We are mostly an academic environment, so between semesters, nationwide, we lose people, so from May to early June, we lose people, and then they get a Christmas break, it is much more heavily used when schools are in session and during the summer when grad students and postdocs are working really hard and not teaching classes or taking classes. So the number of people logged in goes up and down.
Tim: Do you use that downtime to do things like training?
Bill: No we do training basically all the time. We can sequester off parts of the system so that people who have come for training class can get some dedicated access to that part of the system. There are hundreds of jobs running at a time, and hundreds of users submitting jobs so we can pull back a little piece, 20 or 30 nodes, and dedicate that to a training class, and we just put it back in the mix when the training is over.
Tim: There are so many big clusters like this around the world, right now, is the environment it is not a commodity, but can users at Stanford or other schools who want submit a job, is it sort of a – is it getting normal to submit jobs to big clusters like this or it means it must be done separately, differently, or Stampede.
Bill: They are always a little different. But I think anybody that is familiar with a Linux command line will be very comfortable. There are probably six or eight different schedulers that you can either buy or download for free and use. And so there is a little bit of a learning curve. If you’ve used PBS before and now you are using SLURM well, okay you’ve got to learn the commands there and the directives and stuff. But it is mostly submitting a shell script, so if you know BASH or PC shell you are looking at the same thing. A little bit of a different environment in terms of how you find the tools, how you link the libraries, and how you find where those libraries are located, but we use similar environments as other people. So people get comfortable with it, and you make some minor tweaks and go on with your process.
Tim: You are using a pretty off-the-shelf Linux sister CentOS. Why Centos?
Bill: I think there are two major reasons. One is that some of the HPC libraries and tools that we use are kernel sensitive. And so things like Lustre and the Xeon Phi software stack and the Infiniband Stack are sensitive to, and only supported on certain kernel versions. And so being in that Red Hat Enterprise Linux/CentOS family is one of the supportive families where all of those things that are stacked in a particular kernel, and so being able to get to that kernel, and being able to use that kernel is important to be able to support Lustre and the Phi, and the other one that I have not gone to be able to support Lustre and the Phi and the IB stack.
And so we kind of have to. And SUSE would probably be another choice that we would be capable to deal with. The other thing that we like about CentOS being a free version, completely free version of Red Hat environment is that we are RPM based as well. So every third party package that we build on behalf of our users, we build as an RPM and then use the RPM process to install. And so everything is based on RPM for us, and so being Red Hat-ish was good for us.
Tim: I want to ask you about the interaction between you here at Tech and the users who could be anywhere.
Tim: There is an improvement process for their software?
Tim: There’s not?
Bill: And so yes, that was an interesting question when you sent it to me, the users get their project ideas vetted by a national peer review body, and then they get allocated basically some funny money time – they don’t pay for it, but we allocate in terms of core hours. So they might get a million core hours for a year. And then they come, they SSH in to the login nodes, they can build whatever they want. If they do something that we detect as abusive, then we get in touch with them, and say, “You are being up on the login node, you are crashing the file system, or any of those sorts of things, you are running a Bitcoin miner and that you shouldn’t be doing that, any of that kind of stuff, you are running an IRC chat bot.”
We had that kind of stuff in the past, and we try to say,“Hey we detected this, that’s beyond our policies, please stop.” But they can run basically whatever they want. If they are not getting the kind of performance they want, or things aren’t going well for them, they can get in touch with us, and we will help them out to figure out how to port their code, and how to improve their performance. Like I said before, we’ve got a huge user community and a relatively small number of people about 30 who interact with our user community on a regular basis. It is pretty much a free-for-all, but they know that it is a precious resource and they shouldn’t abuse their time.
Tim: Well, since Bitcoin mining is out, can I ask you to talk about a few of the ____11:56 projects that you know that run on this?
Bill: Sure, most of the science that happens on Stampede is open science, that people don’t want to publish, basically required in order to get time, that you are going to go out and publish your result, so we have a lot of chemistry, earthquake modeling, we had people come and run, on a previous system, come and run models of the Japanese earthquake right after it happened from Japan, and we helped share some time with them, because their supercomputers were down, because their electricity was off, so we had some people come and run on a previous system for that.
We support weather modeling, you name it, we are finding people that want to do it, a lot of biology, a lot of big data sort of stuff starting to happen, so it really covers the full gamut. We have 800 or 1000 active projects on the system since January, so if you can think of an area of computing there is probably somebody doing it.
Tim: If they are willing to publish, are commercial projects
Bill: Absolutely. In fact, we have a handful of commercial partners that get access to the system, and work with us, and then publish their results. So we find that all companies don’t bring their super-secret data set but they might bring their algorithm and a synthetic data set that is like a real reservoir, but it is not really a reservoir, and then they will do some experiments. They have their own big systems mostly. But sometimes they want to work with an academic community to get some fresh ideas and do things like that.
Tim: A lot of government funded projects as well?
Bill: Yeah, the vast majority of the projects that run on the system are funded by the National Science Foundation. It is not a requirement even though the National Science Foundation runs the allocation process and paid for the system, anyone doing open science research in the US can be a principal investigator on one of these projects and apply for time. And then any of their project personnel can run. So those people might be international people in Europe, in Asia, in Africa, in South America all running working with and collaborating with people in the US on allocations of time. And most of those people are in a self-funded some NIH, some DOD, some DOE but mostly ____13:57.
Tim: Now speaking of cooperating with commercial entities, this is entirely a system with Dell made machines, in here, now Dell’s just up the road. Does that help at all?
Bill: Yeah, actually it does. We do deal with them in person. My colleagues are having a meeting with Dell right now sort of biweekly they are meeting to talk about Stampede. We don’t partner with Dell just because they are here, and just because they are in our backyard, but we do have a good relationship with them, if we have a problem we can get somebody here anytime we need them. And that has been a really nice relationship for us to have and a strategic partnership for us. They system we just turned off was built by Sun and Oracle helped run it in the last year for the project. No vendor is special. We are always looking for the best system at the right time at the right price. This one, we are happy to have Dell helping us run and having them nearby is always helpful.
Tim: You don’t seem like a very panicky person. How much do you have to actually watch nuclear control?
Bill: You know, it is not we don’t have a big control room with 55 displays and blinking lights and fancy stuff,
Tim: You got plenty of blinking lights.
Bill: We’ve got plenty of these kind of blinking lights, but it is not the war room or anything like that. It is mostly some log script watching, a lot of automated stuff, we got Nagios and some home built tools watching certain things. The network goes down, Nagios will let us know, people get paged. If the room literally catches on fire, we have some people here to help direct the fire department hit
Tim: How big is the core group that would be able to respond?
Bill: So the sysadmin group, our advanced computing systems group is about a dozen people. One dedicated sysadmin whose only job is to work on Stampede, and then a lot of people who have roles that are sliced across several systems. So we might have a network engineer who works on lots of systems, we might have a hardware engineer who is responsible for swapping drives and repairing hardware and other things that is spread across several systems. So it is probably four full time sysadmins to keep track of the system, but that is sliced across a dozen people, with one dedicated person who sort of owns the system and it’s theirs.
Tim: That seems like an amazing small group for a system that takes up so much space.
Bill: Yeah, it is interesting, going from one server to say a 100, requires some scale up in personnel, but then once you hit that level, then it is just adding one node, if we had 50 times this much, we would need more hardware people just to handle all the issues, but the software environment is such that it is relatively straightforward to have this scripts and tools that you need to automate a lot of it.
Tim: This job is pretty fun for you, isn’t it?
Bill: I like it.