I am working on a problem for scheduling VMs considering efficient resource and energy utilisation and I came across this paper. I understand RL and how Q-Learning works which they are trying to use in paper. However, I am not able to achieve an intuitive understanding of the algorithm suggested (page 3).

I understand that equal importance has been given to utilisation and power consumption but with reverse, let’s say signs. But Step-3 is not intuitive. Can someone help me get a better understanding of the same algorithm?