Cassandra high cpu load issues with 3.11.1

We have a 12 node Cassandra cluster with below specs 8 cores 16GB HEAP/32GB RAM with G1GC

Java version : openjdk version “1.8.0_151”

All of a sudden we started seeing some high cpu load (which is around 18-24 on 8 core nodes)

When I tried to get cassandra stack trace, it was showing lot of runnable threads like below.

sun.nio.ch.FileDispatcherImpl.read0(Native Method)  MessagingService-Incoming-/10.xx.xx.xx  MessagingService-Incoming-/10.xx.xx.xx at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.cassandra.io.util.NIODataInputStream.reBuffer(NIODataInputStream.java:66) at org.apache.cassandra.io.util.RebufferingInputStream.readByte(RebufferingInputStream.java:144) at org.apache.cassandra.io.util.RebufferingInputStream.readPrimitiveSlowly(RebufferingInputStream.java:108) at org.apache.cassandra.io.util.RebufferingInputStream.readInt(RebufferingInputStream.java:188) at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:179) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94) 

and

"epollEventLoopGroup-2-9": running at io.netty.channel.epoll.Native.epollWait0(Native Method) at io.netty.channel.epoll.Native.epollWait(Native.java:117) at io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:226) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:250) at io.netty.util.concurrent.SingleThreadEventExecutor$  2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$  DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) 

First thread mentioned above has 35 occurrences and 24 occurrences for 2nd thread.

Can any one figure out what is wrong here ??

From the cluster side, **

  • Don’t have any pending compactions/tasks.
  • GC pauses are below 100ms

**

Thanks