Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Networking Data Storage Supercomputing

Ceph: a Journey To 1 TiB/s (ceph.io) 16

It's "a free and open-source, software-defined storage platform," according to Wikipedia, providing object storage, block storage, and file storage "built on a common distributed cluster foundation". The charter advisory board for Ceph included people from Canonical, CERN, Cisco, Fujitsu, Intel, Red Hat, SanDisk, and SUSE.

And Nite_Hawk (Slashdot reader #1,304) is one of its core engineers — a former Red Hat principal software engineer named Mark Nelson. (He's now leading R&D for a small cloud systems company called Clyso that provides Ceph consulting.) And he's returned to Slashdot to share a blog post describing "a journey to 1 TiB/s". This gnarly tale-from-Production starts while assisting Clyso with "a fairly hip and cutting edge company that wanted to transition their HDD-backed Ceph cluster to a 10 petabyte NVMe deployment" using object-based storage devices [or OSDs]...) I can't believe they figured it out first. That was the thought going through my head back in mid-December after several weeks of 12-hour days debugging why this cluster was slow... Half-forgotten superstitions from the 90s about appeasing SCSI gods flitted through my consciousness...

Ultimately they decided to go with a Dell architecture we designed, which quoted at roughly 13% cheaper than the original configuration despite having several key advantages. The new configuration has less memory per OSD (still comfortably 12GiB each), but faster memory throughput. It also provides more aggregate CPU resources, significantly more aggregate network throughput, a simpler single-socket configuration, and utilizes the newest generation of AMD processors and DDR5 RAM. By employing smaller nodes, we halved the impact of a node failure on cluster recovery....

The initial single-OSD test looked fantastic for large reads and writes and showed nearly the same throughput we saw when running FIO tests directly against the drives. As soon as we ran the 8-OSD test, however, we observed a performance drop. Subsequent single-OSD tests continued to perform poorly until several hours later when they recovered. So long as a multi-OSD test was not introduced, performance remained high. Confusingly, we were unable to invoke the same behavior when running FIO tests directly against the drives. Just as confusing, we saw that during the 8 OSD test, a single OSD would use significantly more CPU than the others. A wallclock profile of the OSD under load showed significant time spent in io_submit, which is what we typically see when the kernel starts blocking because a drive's queue becomes full...

For over a week, we looked at everything from bios settings, NVMe multipath, low-level NVMe debugging, changing kernel/Ubuntu versions, and checking every single kernel, OS, and Ceph setting we could think of. None these things fully resolved the issue. We even performed blktrace and iowatcher analysis during "good" and "bad" single OSD tests, and could directly observe the slow IO completion behavior. At this point, we started getting the hardware vendors involved. Ultimately it turned out to be unnecessary. There was one minor, and two major fixes that got things back on track.

It's a long blog post, but here's where it ends up:
  • Fix One: "Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. A quick check of the bios on these nodes showed that they weren't running in maximum performance mode which disables c-states."
  • Fix Two: [A very clever engineer working for the customer] "ran a perf profile during a bad run and made a very astute discovery: A huge amount of time is spent in the kernel contending on a spin lock while updating the IOMMU mappings. He disabled IOMMU in the kernel and immediately saw a huge increase in performance during the 8-node tests." In a comment below, Nelson adds that "We've never seen the IOMMU issue before with Ceph... I'm hoping we can work with the vendors to understand better what's going on and get it fixed without having to completely disable IOMMU."
  • Fix Three: "We were not, in fact, building RocksDB with the correct compile flags... It turns out that Canonical fixed this for their own builds as did Gentoo after seeing the note I wrote in do_cmake.sh over 6 years ago... With the issue understood, we built custom 17.2.7 packages with a fix in place. Compaction time dropped by around 3X and 4K random write performance doubled."

The story has a happy ending, with performance testing eventually showing data being read at 635 GiB/s — and a colleague daring them to attempt 1 TiB/s. They built a new testing configuration targeting 63 nodes — achieving 950GiB/s — then tried some more performance optimizations...


This discussion has been archived. No new comments can be posted.

Ceph: a Journey To 1 TiB/s

Comments Filter:
  • by Shag ( 3737 ) on Saturday January 20, 2024 @12:30PM (#64175169) Journal

    The other faster-than-RAID solution I'm familiar with, MIT's Mark 6 VLBI data system, only does something like 16Gbps, I think.

    • If you just care about the S3 protocol, I've gotten some impressive numbers from a load balancer and MinIO hosts. You make sure the machines are near identically configured, format the drives used with MinIO with XFS, and let MinIO handle the RAID heavy lifting between drives and between hosts. This means a fast network fabric, so 20gigE, 40gigE or even 100gigE is needed. However, what this gives, is not just fast object reads/writes, but object locking, which, assuming the MinIO machines are firewalled

  • Some updates (Score:5, Informative)

    by Nite_Hawk ( 1304 ) on Saturday January 20, 2024 @12:33PM (#64175177) Homepage

    Just to clarify: I actually don't work at Red Hat anymore. I'm leading R&D for a small company called Clyso that provides Ceph consulting and support along with my good friends Joachim Kraftmayer (Founder) and Dan van der Ster (Former architect of the Ceph storage at CERN).

    One note I wanted to add since there has been some discussion about it on other forums:

    We've never seen the IOMMU issue before with Ceph. We have previous generation 1U Dell servers with older EPYC processors in the upstream Ceph performance lab and they've never shown the same symptoms. The customer did however tell us that the have seen issues with IOMMU before in other contexts in their data center. I'm hoping we can work with the vendors to understand better what's going on and get it fixed without having to completely disable IOMMU.

    • Thanks for sharing the story! (I just added the information from your comment into Slashdot's post...)
    • Would be interesting to find out what the issue is and whether it impacts both Intel and AMD processors. We tested with AMD processors in Dellâ(TM)s lab which they make available to customers and got less performance per dollar compared to Intel, our deployments are hyperconvergent so IOMMU is practically a requirement.

    • by MinusOne ( 4145 )

      This is one of the very few times since I joined that I've seen a user ID lower than mine. We've both been here for a LONG time!

  • Many Thanks! (Score:4, Insightful)

    by flatulus ( 260854 ) on Saturday January 20, 2024 @01:44PM (#64175229)
    Thank you for posting this detailed article. This is the stuff I read Slashdot for. You aren't getting many comments (heady stuff here, and no politics), so I wanted to tick your counter +1 at least.

    I will never get close to the knowledge and experience you demonstrate here, but I am fascinated all the same. I have tinkered with Ceph just a little in my home lab, using ProxMox, and was mightily impressed. To hear what you and your client have accomplished is thrilling! Oh, and kudos for Dell and AMD - I built a small datacenter on my last job (just before I retired) and took a few jeers for choosing AMD. But the Dell servers and EPYC processors have performed admirably, and from what I hear, continue to do so today.
    • by Nite_Hawk ( 1304 )

      Thank you for the kind words! I can't believe I've been working on Ceph for 12 years now. In reality though, this was very much a collaborative effort with the customer. We couldn't have pulled this off without the network they designed, and they were the ones who ultimately figured out the IOMMU issue.

      Good job for being ahead of the curve on AMD! They're really giving Intel a run for their money, especially if you favor single-socket configurations.

      • Ceph is only going to become more of a mainstream item, especially with (IMHO) Broadcom putting the squeeze on VMWare customers, which means a lot of companies will be looking at Proxmox, and its main protocol that does what VMFS does, is Ceph. I was surprised that Ceph can work over block connections like iSCSI or Fiber Channel, as I thought it was a successor to AFS/DFS, so with Ceph in place and a solid SAN with multipathing and multiple controllers, this can provide Proxmox an enterprise-grade foundati

  • For those of us who normally acronym-skip but then hit the buffers when one of them is critical to unlocking the story, here's IOMMU: https://en.wikipedia.org/wiki/... [wikipedia.org]–output_memory_management_unit
  • A few years ago I began looking at cluster storage options and was introduced to Ceph as well as Glusterfs. Looking around I found a distribution call Petasan that used ceph. It was quick and easy to stand up a 5 node 60 osd cluster for testing in my lab with iscsi, nfs and smb.
  • This might very well be the most interesting post I've seen all year. Hearing about the progression of this and how you worked with customers and vendors on such a technical issue was such a relief compared to pretty much all the other "stuff" they post here.

    Thank you for sharing this!

  • Figures like that are fairly easy to achieve depending on the hardware that you throw at it. To put into context GPFS is benchmarked at over 2.5TiB/s and I am sure it could go higher so the performance of 1TiB/s is hardly impressive.

    Generally what is more important is things like management, backup etc. that headline benchmark numbers because in real life you ain't never going to get anywhere near that based on my 15 years of experience in HPC.

"Being against torture ought to be sort of a multipartisan thing." -- Karl Lehenbauer, as amended by Jeff Daiell, a Libertarian

Working...