This article says the following:
Deciding between the sigmoid or tanh will depend on your requirement of gradient strength.
I have seen (so far in my learning) 7 activation functions/curves. Each one seems to be building on the last. But then like the quote above, I have read in many places essentially that "based on your requirements, select your activation function and tune it to your specific use case".
This doesn’t seem scalable. From an engineering perspective, a human has to come in and tinker around with each neural network to find the right or optimal activation function, which seems like it would take a lot of time and effort. I’ve seen papers which seem to describe people working on automatically finding the "best" activation function for a particular data set too. From an abstraction standpoint, it’s like writing code to handle each user individually on a website, independently of the others, rather than just writing one user authentication system that works for everyone (as an analogy).
What all these are papers/articles are missing is an explanation of why. Why can’t you just have one activation function that works in all cases optimally? This would make it so engineers don’t have to tinker with each new dataset and neural network, they just create one generalized neural network and it works well for all the common tasks today’s and tomorrow’s neural networks are applied to. If someone finds a more optimal one, then that would be beneficial, but until the next optimal one is found, why can’t you just use one neural network activation function for all situations? I am missing this key piece of information from my current readings.
What are some examples of why it’s not possible to have a keystone activation function?