quantitatively comparing AST shapes

How could one compare the shape of abstract syntax trees of similar source code programs (C, C++, Go, or anything compiled with GCC…)?

I guess that plagiarism detection on source code would use such techniques, but I have no idea of how would that be called…

For example, unification could be used to compare AST, but it gives only a boolean answer. I’m seeking for some technique giving some numerical “distance”, or some kind of numerical vectors (to be later feed up e.g. to machine learning or classification algorithms, or some other big data thing).

Any references to big data or machine learning approaches on large set of source code is welcome too.

(Sorry for such a broad or fuzzy question, I don’t know what terminology to use)

I don’t simply want to compare two ASTs or programs. I want to process a large set of programs (e.g. half of a Debian distribution source code) and find inside it similar routines. I already have MELT to work on GCC internal representations (Gimple) and I want to leverage above that, hence store several metrics (which ones? cyclomatic complexity is probably not enough) in e.g. some database and compare & process them…

Addenda: Found about the MOSS system & paper, but it does not seem to care about syntactic shape at all. Also looking into tree edit distance.

Found also (thanks to Jérémie Salvucci) Michel Chilowicz’s PhD thesis (in French, november 2010) on Looking for Similarity in Source Code

How to justify using available code (in different language) for comparing algorithms

I have proposed an algorithm for a scheduling problem in a submitting paper. In the revision, the reviewer asked us to compare with another algorithm in the literature. Our algorithm is in MATLAB, and the comparing one is in C++, and the code is publicly available. We did not re-implement the C++ code, to avoid any decrease in the efficiency of their algorithm, and to save time as well. Now the reviewer is responding: “It is probable that there is a significantcant difference in performance between MATLAB and C++. The authors should make it clear if and how the results were normalized to ensure a fair comparison.”

So my question is this: Is there any (scientific) ratio or similar comparison between the efficiency of MATLAB and C++?

When we opted to use the available code, we thought it is completely OK since MATLAB is known to be slower. So using the comparing algorithm in a faster environment is OK. I should add that our algorithm is now performing much better than the comparing one.

Can the sorting of a list be verified without comparing neighbors?

A $ n$ -item list can be verified as sorted by comparing every item to its neighbor. In my application, I will not be able to compare every item with its neighbor: instead, the comparisons will sometimes be between distant elements. Given that the list contains more than three items and also that comparison is the only supported operation, does there ever exist a “network” of comparisons that will prove that the list is sorted, but is missing at least one direct neighbor-to-neighbor comparison?

Formally, for a sequence of elements $ e_i$ , I have a set of pairs of indices $ (j,k)$ for which I know whether $ e_j > e_k$ , $ e_j = e_k$ , or $ e_j < e_k$ . There exists a pair $ (l,l+1)$ that is missing from the set of comparisons. Is it ever possible, then, to prove that the sequence is sorted?

Improve Performance of Comparing two Numpy Arrays

I had a code challenge for a class I’m taking that built a NN algorithm. I got it to work but I used really basic methods for solving it. There are two 1D NP Arrays that have values 0-2 in them, both equal length. They represent two different trains and test data The output is a confusion matrix that shows which received the right predictions and which received the wrong (doesn’t matter ;).

This code is correct – I just feel I took the lazy way out working with lists and then turning those lists into a ndarray. I would love to see if people have some tips on maybe utilizing Numpy for this? Anything Clever?

import numpy as np  x = [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0] y = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  testy = np.array(x) testy_fit = np.array(y)  row_no = [0,0,0] row_dh = [0,0,0] row_sl = [0,0,0]  # Code for the first row - NO for i in range(len(testy)):     if testy.item(i) == 0 and testy_fit.item(i) == 0:         row_no[0] += 1     elif testy.item(i) == 0 and testy_fit.item(i) == 1:         row_no[1] += 1     elif testy.item(i) == 0 and testy_fit.item(i) == 2:         row_no[2] += 1  # Code for the second row - DH for i in range(len(testy)):     if testy.item(i) == 1 and testy_fit.item(i) == 0:         row_dh[0] += 1     elif testy.item(i) == 1 and testy_fit.item(i) == 1:         row_dh[1] += 1     elif testy.item(i) == 1 and testy_fit.item(i) == 2:         row_dh[2] += 1  # Code for the third row - SL for i in range(len(testy)):     if testy.item(i) == 2 and testy_fit.item(i) == 0:         row_sl[0] += 1     elif testy.item(i) == 2 and testy_fit.item(i) == 1:         row_sl[1] += 1     elif testy.item(i) == 2 and testy_fit.item(i) == 2:         row_sl[2] += 1  confusion = np.array([row_no,row_dh,row_sl])  print(confusion)  

the result of the print is correct as follow:

[[16 10  0]  [ 2 10  0]  [ 2  0 22]] 

Comparing two Riemannian metrics on Grassmannian

Let $ G_r(n)$ be the real Grassmannian which is the collection of all $ r$ dimensional subspace in $ \mathbb{R}^n$ equipped with the usual invariant metric $ g$ .

Let $ U_A\in\mathbb{R}^{n\times r}$ and $ U_B\in\mathbb{R}^{n\times r}$ be the orthonormal basis of $ A$ and $ B$ . Let $ 1\geq\sigma_1\geq…\geq\sigma_r\geq0$ be the singular values of $ U_A^TU_B$ . It is well known that the geodesic distance under metric $ g$ between two elements $ A,B\in G_r(n)$ is the following: $ $ d_g(A,B)=\sqrt{\sum_{i=1}^r\arccos^2\sigma_i}$ $ $ \arccos\sigma_i,i=1,…,r$ are also called the principal angles between $ A$ and $ B$ .

Now for a given sequence $ w_1,…,w_r>0$ , define another distance metric on $ G_r(n)$ , such that for any subspace $ A,B$ : $ $ \tilde{d}_W(A,B)=\sqrt{\sum_{i=1}^rw_i\arccos^2\sigma_i}$ $ My questions are the following:

  1. Does there exist another Riemannian metric $ \tilde{g}$ on $ G_r(n)$ such that the geodesic distance is exactly $ \tilde{d}_W$ ?

  2. If $ \tilde{g}$ exists, let $ \tilde{\mu}$ be the volume measure induced by $ \tilde{g}$ and $ \mu$ be the volume measure induced by $ g$ . For a given $ A\in G_r(n)$ and $ a>0$ , What is the relationship between $ \mu(\{B\in G_r(n): \tilde{d}_W(A,B)\leq a\})$ and $ \tilde{\mu}(\{B\in G_r(n): d_\tilde{g}(A,B)=\tilde{d}_W(A,B)\leq a\})$

  3. It is known that $ G_r(n)$ under metric $ g$ has positive Ricci curvature. Does $ G_r(n)$ under metric $ \tilde{g}$ still have positive Ricci curvature?

Comparing columns in pandas different data frames and fill in a new column

I have two dataframes: One contains of company and its corresponding texts. The texts are in lists

**supplier_company_name   Main_Text**  JDA SOFTWARE          ['Supply chains','The answer is simple -RunJDA!']  PTC                    ['Hello', 'Solution'] 

The second dataframe is texts extracted from the company’s website.

      Company            Text    0   JDA SOFTWARE    About | JDA Software     1   JDA SOFTWARE    833.JDA.4ROI 2   JDA SOFTWARE    Contact Us 3   JDA SOFTWARE    Customer Support     4   PTC             Training     5   PTC             Partner Advantage 

I want to create the new column in second dataframe if the text extracted from the web matches with the text in the Main text column of the first data frame, fill True else fill False.

Code:

target = [] for x in tqdm(range(len(df['supplier_company_name']))):#company name in df1     #print(x)     for y in range(len(samp['Company']):#company name in df2         if samp['Company'][y] == df['supplier_company_name'][x]: #if the company name matches             #check if the text matches             if samp.iloc[:,1][y] in df['Cleaned_company_description'][x]:                 target.append(True)             else:                 target.append(False) 

How can I change my code to run efficiently?

Comparing ample and nef line bundles

I started reading Positivity in Algebraic Geometry I by Roberts Lazarsfeld, and he introduces nef (numerically effective) line bundles, after observing that for an ample line bundle $ \mathscr{L}$ , one has \begin{align} (\mathscr{L}^{\otimes k} \cdot V) > 0 && \text{for all subvarieties }V \subset X \text{ of dimension } k \end{align}

Following this, a line bundle $ \mathscr{L}$ is defined to be nef, if \begin{align} (\mathscr{L} \cdot C) \geq 0 && \text{for all curves } C \subset X.\end{align} In particular we can reduce the question about nefness to computing the degree of $ \mathscr{L}|_C$ . Then Lazarsfeld proves a theorem by Kleiman, that $ \mathscr{L}$ is nef if and only if \begin{align}(\mathscr{L}^{\otimes k} \cdot V) \geq 0 && \text{for all subvarieties } V \subset X \text{ of dimension } k\end{align} So nef is clearly a generalization of ample.

I wonder if there is an analogous statement for ampleness, i.e. $ \mathscr{L}$ is ample if and only if $ \deg{\mathscr{L}|_C} > 0$ for all curves $ C \subset X$ .

Comparing each item from dir with each item from another dir

The task is to compare students homework SQL files with mentors SQL files.

I’ve written two functions, which return a two-dimensional array (1st elements are an absolute path, 2nd are a relative).

Then I’m going to compare the relative path of students and mentors and execute SQL files (finding using absolute path) if these values are equal

Is there a more elegant realization?

The folder structure of mentors dir: Homework (folder) ├ 1 (folder) | ├ 1.sql | ├ 2.sql | └ n.sql ├ 2 (folder) | ├ 1.sql | ├ 2.sql | └ n.sql ├ n (folder) | ├ 1.sql | ├ 2.sql | └ n.sql

The folder structure of students dir: ├Students Homework (folder) ├Student1(folder) ├ 1 (folder) | ├ 1.sql | ├ 2.sql | └ n.sql ├ 2 (folder) | ├ 1.sql | ├ 2.sql | └ n.sql ├ n (folder) | ├ 1.sql | ├ 2.sql | └ n.sql ├Student2(folder) ├ 1 (folder) | ├ 1.sql | ├ 2.sql | └ n.sql ├ 2 (folder) | ├ 1.sql | ├ 2.sql | └ n.sql ├ n (folder) | ├ 1.sql | ├ 2.sql | └ n.sql

“Mentors” function:

def find_mentors_sql(config):      mentors_sql_abs = []     mentors_sql_rel = []      for dirpath, subdirs, files in walk(config["MAIN_DIR"] + '\Homework'):         mentors_sql_abs.extend(path.join(dirpath, x) for x in files if x.endswith(".sql"))         mentors_sql_rel.extend(path.join(path.basename(dirpath), x) for x in files if x.endswith(".sql"))      mentors_sql = [[0] * 2 for i in range(len(mentors_sql_abs))]      iter = 0     for _ in mentors_sql_abs:         mentors_sql[iter][0] = mentors_sql_abs[iter]         iter += 1      iter1 = 0     for _ in mentors_sql_rel:         mentors_sql[iter1][1] = mentors_sql_rel[iter1]         iter1 += 1      return mentors_sql  

“Students” function (the logic is similar to the previous one:

def find_students_sql(config):      students_sql_abs = []     students_sql_rel = []      for dirpath, subdirs, files in walk(config["MAIN_DIR"] + '\Students Homework'):         students_sql_abs.extend(path.join(dirpath, x) for x in files if x.endswith(".sql"))         students_sql_rel.extend(path.join(path.basename(dirpath), x) for x in files if x.endswith(".sql"))      students_sql = [[0] * 2 for i in range(len(students_sql_abs))]      iter = 0     for _ in students_sql:         students_sql[iter][0] = students_sql_abs[iter]         iter += 1      iter1 = 0     for _ in students_sql:         students_sql[iter1][1] = students_sql_rel[iter1]         iter1 += 1      return students_sql ```