How could one compare the shape of abstract syntax trees of similar source code programs (C, C++, Go, or anything compiled with GCC…)?
I guess that plagiarism detection on source code would use such techniques, but I have no idea of how would that be called…
For example, unification could be used to compare AST, but it gives only a boolean answer. I’m seeking for some technique giving some numerical “distance”, or some kind of numerical vectors (to be later feed up e.g. to machine learning or classification algorithms, or some other big data thing).
Any references to big data or machine learning approaches on large set of source code is welcome too.
(Sorry for such a broad or fuzzy question, I don’t know what terminology to use)
I don’t simply want to compare two ASTs or programs. I want to process a large set of programs (e.g. half of a Debian distribution source code) and find inside it similar routines. I already have MELT to work on GCC internal representations (Gimple) and I want to leverage above that, hence store several metrics (which ones? cyclomatic complexity is probably not enough) in e.g. some database and compare & process them…
Addenda: Found about the MOSS system & paper, but it does not seem to care about syntactic shape at all. Also looking into tree edit distance.
Found also (thanks to Jérémie Salvucci) Michel Chilowicz’s PhD thesis (in French, november 2010) on Looking for Similarity in Source Code