I’m currently working on a multi-agent reinforcement learning. The setting is a cooperative multi-agent system with a stationary assumption on the opponent policy. Suppose I have the model of the opponent policy $ \rho(\cdot|s_t)$ , where $ s_t$ denotes the environment state at time $ t$ . I’m currently wondering how to actually make use of it to design our controllable agent policy $ \pi(\cdot|s_t)$ ?

Intuitively, I think the form of the policy should be like $ \pi(\cdot|s_t,\rho(\cdot|s_t))$ . Yet, I don’t have any idea to parameterize the expression, i.e. make it as a functional form $ \pi(\cdot|s_t,\rho(\cdot|s_t)) = f(s_t,\rho(\cdot|s_t))$ . Does anyone have some clue with a sound/best practice parameterization? Any thought would be appreciated.