I want to do sound recognition with deep learning

The aim of this software would be to analyse video files and to mark a timestamp in the file where the sample sound is found. The sample sound would be anything, a word, a scream, the certain way a dog door sounds when it closes, any audible repeatable noise.

The software would take in audio file/s and train with deep learning on that data (not sure how that works) it would then identify other sounds that are similar in video files which are passed in.

My question is how should I get started, I have experience in C, C++, C#, and Python (willing to learn). When I think deep learning, I think TensorFlow but I have the feeling that is probably not the best system to use. I’d like some suggestions, thanks.