# How to make use of opponent policy on designing ours in multi-agent learning

I’m currently working on a multi-agent reinforcement learning. The setting is a cooperative multi-agent system with a stationary assumption on the opponent policy. Suppose I have the model of the opponent policy $$\rho(\cdot|s_t)$$, where $$s_t$$ denotes the environment state at time $$t$$. I’m currently wondering how to actually make use of it to design our controllable agent policy $$\pi(\cdot|s_t)$$?

Intuitively, I think the form of the policy should be like $$\pi(\cdot|s_t,\rho(\cdot|s_t))$$. Yet, I don’t have any idea to parameterize the expression, i.e. make it as a functional form $$\pi(\cdot|s_t,\rho(\cdot|s_t)) = f(s_t,\rho(\cdot|s_t))$$. Does anyone have some clue with a sound/best practice parameterization? Any thought would be appreciated.