Compaction error: Corruption: block checksum mismatch: expected 862584094, got 1969278739

Currently trying to handle this right on my 3-day memorial weekend 😀

  • Ceph 13.2.4 (Filestore)
  • Rook 0.9
  • Kubernetes 1.14.1

https://gist.github.com/sfxworks/ce77473a93b96570af319120e74535ec

My setup is a Kubernetes cluser with rook handling Ceph. Using 13.2.4, I have this issue with one of my OSDs always restarting. This happened recently. No power failure or anything occurred on the node.

2019-05-25 01:06:07.192 7fb923359700  3 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.4/rpm/el7/BUILD/ceph-13.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1929] Compaction error: Corruption: block checksum mismatch: expected 862584094, got 1969278739  in /var/lib/rook/osd1/current/omap/002408.sst offset 15647059 size 3855 

There are a few more in this gist with a similar error message. The only other one states:

2019-05-25 01:06:07.192 7fb939a4a1c0  0 filestore(/var/lib/rook/osd1) EPERM suggests file(s) in osd data dir not owned by ceph user, or leveldb corruption 

I checked this on the node. All is root as the other ones are. It’s also containerized, and deleting the pod to have the operator recreate this did not help.

Only thing I was able to find to assist was https://tracker.ceph.com/issues/21303, but this seems to be a year old. I am not sure where to begin with this. Any leads or points to some documentation to follow, or a solution if you have one, would be of great help. I see some tools for bluestore, but I do not know how applicable they are and want to be very careful given the situation.

In the worst-case scenario, I have backups. Willing to try things within reason.