I was considering using a transformer, on input data which can be represented as an embedding, so I can use the attention mechanism in the transformer architecture. As my data is of variable input and output length and the input is sequential. My question is that my output data is suppose to be either numerical or probabilities for each output variable. The output was originally supposed to 13 numerical outputs but I decided to use a probability score as way of normalizing the output. My question is can I use two output vectors with 7 features each instead of 13 numeric outputs. Each feature would map to one of the original output vectors and the the last feature would always be 0. As PyTorch expects your output to be the same number of features as your input. My input variables are embedded as 7 features. Should this approach work, as I am unsure of how the loss function works or is there a loss function that would allow for this.