Making groups out of a list of things based on some similarity

Let’s say I have a list of n items – i1, i2, …, in.

These items are similar to one another based on some similarity function, say similar_func(item_x,item_y) where item_x,item_y belong to list i1, i2, … in. This function returns a number that ranges from 0 to 1, where 0 means not similar at all and 1 means exactly similar.

Now I would like a create a number of groups, each of which may contain items from the above list. Each group having items that are similar to each other items of the group by let’s say 0.5.

I could run a double loop on the list, comparing each item to one another and creating a group whenever I find two items with similarity more than 0.5. Of course I would need to check if the item already exists in a group already created and with all the other items in that group.

This brute force method is possible and will give the right answer.

But I would like to do this more efficiently. Is there a efficient method for solving this problem?

Edit: There can be any number of groups and an item can be in multiple groups.

If similarity(x,y) = 0.6 and similarity(x,z) = 0.6, but similarity(y,z) = 0.4, then there will be two groups -> [x,y] and [x,z]

Also if x is similar to y AND x is similar to z, then it doesnt mean that y is similar to z. The similarity between two items solely depends on the the return value of similarity function when those items are passed as arguments to it.

Finding the similarity between large text files

My first question is: is there an algorithm that already exists for this? If not any thoughts and ideas are appreciated.

Let’s say I have two large text files (original file A and new file B). Each file is English prose text (including dialogue) with a typical size 256K to 500K characters.

I want to compare them to find out how similar the contents of B are to A.

Similar in this case means: all, or a significant part, of B exists in A, with the condition that there may be subtle differences, words changed here and there, or even globally.

In all cases we have to remember that this is looking for similarity, not (necessarily) identity.

Preprocessing for the text:

  1. Remove all punctuation (and close up gaps “didn’t” -> “didnt”);
  2. Lowercase everything;
  3. Remove common words;
  4. Reduce all whitespace to single space only, but keep paragraphs;

Other possible optimisations to reduce future workload:

Ignore any paragraph of less than a certain length. Why? Because there’s a higher probability of natural duplication in shorter paragraphs (though arguably not in the same overall position).

Have an arbitrary cut-off length on the paragraphs. Why? Mostly because it reduces workload.

Finally:

For every word, turn it into a Metaphone. So instead of every paragraph being composed of normal words it becomes a list a metaphones which help in comparing slightly modified words.

We end up with paragraphs that look like this (each of these lines is a separate paragraph):

WNT TR0 ABT E0L JRTN TTKTF INSPK WLMS E0L UTRL OBNKS JRL TM RL SRPRS LKT TRKTL KM N SX WRT LT ASK W0R RT WRKS T ST N WLTNT RT 0M AL 

But I admit, when it comes to the comparison I’m not sure how to approach it, beyond a brute force take the first encoded paragraph from B (B[0]) and check every paragraph in A looking for a high match (maybe identical, maybe very similar). Perhaps we use Levenshtein to find a match percentage on the paragraphs.

If we find a match at A[n], then check B[1] against A[n+1] and maybe a couple further A[n+2] and A[n+3] just in case something was inserted.

And proceed that way.

What should be detected:

  • Near-identical text
  • Global proper noun changes
  • B is a subset of A

Thanks.

Computing similarity of two graphs with partially overlapping sets of nodes

Consider two graphs $ G_1 = (E_1, V_1) $ and $ G_2 = (E_2, V_2)$ with their associated sets of edges $ E$ and nodes $ V$ . I’m familiar with concepts such as edit distance for computing the similarity/distance between these two graphs. I was wondering if there are any metrics for estimating similarity between graphs where they contain only partially overlapping sets of nodes, for example, if $ V_1 = (A, B, C) $ and $ V_2 = (B, C, D)$ .

How can I express the similarity between a Bing and a Google search result?

I’m working on a “semantic” browser engine where all search engines should look the same. One way to do this is to hard-code parsing rules for each site; another is to use machine-learning. Of course the latter alternative is the hip and cool one. 😉

I need a method to cluster links from search results from Bing, Google and other popular search engines. Menu links, pagination, etc, should be in a different cluster than listed search results. I’ve tried KMeans, it didn’t work so good. DBSCAN worked fine with Google, but I couldn’t easily transfer the result to Bing.

There are a lot of different features one can consider for HTML elements, in my case <a>, anchors. I’ve tried combining number of children (recursively), text length, number of attributes and some more. Is it possible to advice in which direction to go? I have no former experience with machine-learning, so tuning an algorithm by hand like this is not trivial.

Thankful for any feedback.

Point cloud clustering based on similarity in less than $O(n^2)$

not sure if this is the right place to ask this but here it goes.

Let’s assume I have some 2D points dataset consisting of facial landmarks, and I want to cluster these based on similarity so that I can refer to a notion of “prototype” of a facial expression.

As a measure between one point set and another I could be using a sum of euclidean distances.

Is there a way to obtain a set of prototypes without passing through the landmarks $ O(n^2)$ times?

To better explain what I’m thinking, let’s say I’m processing a video frame by frame.

I start with the landmarks of the first frame, and I put it in my list of prototypes because I have no other reference.

Following this, for each subsequent frame, I compare it with the first “prototype” and if it’s below a certain similarity threshold I assume it’s not unique enough and skip it, and so until I find one set of landmarks that are dissimilar enough, so now I have two “prototypes”.

From this point onward, I need to do the similarity check with two “prototypes”, and so on.

Another caveat is that I would also like to be able to store the “prototype” that the current frame matches the most.

I will also need to do a second pass through a second clip for a similar matching with the “prototypes” identified in the first pass.

Is there a more efficient way to do this, other than the naive approach?

Orthogonal similarity of adjacency matrices of graphs which are cospectral, have cospectral complements and have a common equitable partition

Let $ G$ and $ H$ be two undirected graphs of the same order (i.e., they have the same number of vertices). Denote by $ A_G$ and $ A_H$ the corresponding adjacency matrices. Furthermore, denote by $ \bar G$ and $ \bar H$ the complement graphs of $ G$ and $ H$ , respectively.

When $ G$ and $ H$ are cospectral, and $ \bar G$ and $ \bar H$ are cospectral, it is known (see e.g., Theorem 3 in Van Dam et al. [1]) that there exists an orthogonal matrix $ O$ such that $ A_G\cdot O=O\cdot A_H$ and furthermore, $ O\cdot \mathbf{1}=\mathbf{1}$ , where $ \mathbf{1}$ denotes the vector consisting of all ones.

Suppose that, in addition, $ G$ and $ H$ have a common equitable partition. That is, there exist partitions $ {\cal V}=\{V_1,\ldots,V_\ell\}$ of the vertices in $ G$ and $ {\cal W}=\{W_1,\ldots,W_\ell\}$ of the vertices in $ H$ such that (i) $ |V_i|=|W_i|$ for all $ i=1,\ldots,\ell$ ; and (ii) $ \text{deg}(v,V_j)=\text{deg}(w,W_j)$ for any $ v$ in $ V_i$ and $ w$ in $ W_i$ , and this for all $ i,j=1,\ldots,\ell$ .

Question:

  • What extra structural conditions on the orthogonal matrix $ O$ , apart from $ A_G\cdot O=O\cdot A_H$ and $ O\cdot \mathbf{1}=\mathbf{1}$ , can be derived when $ G$ and $ H$ are cospectral, have cospectral complements, and have a common equitable partition?

I am particularly interested in showing that one can assume that $ O$ is block structured according to the partitions involved. That is, if $ \mathbf{1}_{V_i}$ and $ \mathbf{1}_{W_i}$ denote the indicator vectors of the (common) partitions $ {\cal V}$ and $ {\cal W}$ , respectively, can $ O$ be assumed to satisfy $ $ \text{diag}(\mathbf{1}_{V_i})\cdot O=O\cdot \text{diag}(\mathbf{1}_{W_i}), $ $ for $ i=1,\ldots,\ell$ ? Here, $ \text{diag}(v)$ for a vector $ v$ denotes the diagonal matrix with $ v$ on its diagonal.

[1] Cospectral graphs and the generalized adjacency matrix, E.R. van Dam, W.H. Haemers, J.H. Koolen. Linear Algebra and its Applications 423 (2007) 33–41. https://doi.org/10.1016/j.laa.2006.07.017

Design Pattern for Data Enrichment vs Similarity Matching

I have some things (JSON) coming from various sources (Search engine, Crawled HTML, APIs…). I’m trying to match similar things from the different sources as follows:

[1] Collect -> [2] Prepare -> [3] Match Similar 

With [2] Prepare I do some data manipulation to facilitate matching:

  • Normalization: formatting/cleaning values (eg trimming/lowercasing names)
  • Enrichment: extrapolating data (eg get domains from URLs: http://www.domain1.com/home and https://domain1.com/contact are different URLs but both have the same domain domain1.com so these things are related)

With [3] Match I compare values found in things and match those with most similarities together.

Issues with [3] Match:

  • if I skip enrichment (eg don’t add domains from URLs) there’s just too much noise in the data for the similarity matching to be accurate enough (eg I tried string distance algorithms on URLs which doesn’t work, eg https://www.domain2.com/contact is closer to https://www.domain1.com/contact than http://www.domain1.com/home although the last 2 are more likely to be related as they have the same domain)
  • if I do enrichment: I get much better results HOWEVER the level of matching becomes influenced by how much enrichment I do versus how much of the original data is similar (eg if I add 3 domain fields from the URL to increase the chances I’ll find a match – (1) domain, (2) without any subdomains, (3) without www – then I have potentially 3 matches, whereas 2 similar things which don’t have URLs but only a name will only have 1 match even if it’s a very unique name with a perfect match)

Are there common design patterns to match enriched datasets by taking into account how much of the match is related to original data vs. enrichment data?

How to increase the speed of calculate string similarity score within dataframe?

I have a dataframe as follow:

df = pd.DataFrame(data=[[1, 'Berlin',], [2, 'Paris', ],                     [3, 'Lausanne', ], [4, 'Bayswater',],                     [5, 'Table Bay', ], [6, 'Bejing',],                     [7, 'Bombay',], [8, 'About the IIS']],                     columns=['id', 'text'],) 

and I want to use jaro_winkler from library jellyfish to calculate the similarity score for each string compare with the all of rest, and out put the most similarity one or got the similarity score matrix as follow:

      str1 str2 str3 str1    1   0.6  0.7 str2    0.6  1   0.3 str3    0.7  0.3  1 

how I can get this result as the fast way. Now I just using loop to comparison each one and store the result in list.

 def sim_cal(string1, string2):      similar = jellyfish.jaro_winkler(string1, string2)      return similar 

But if the data become large the speed will very slow, so if have any way can speed up?

Thanks.

Matrice similarity

Let $ A = \begin{bmatrix} a & b\ c & d \end{bmatrix}$ and let $ B$ be the matrix you get by permuting the first and second column of $ A$ , i.e. $ B = \begin{bmatrix} b & a\ d & c \end{bmatrix}$ .

Are $ A$ and $ B$ similar?

I know that by multiplying $ A$ by the elementary matrix for column permutation, you get $ A \times \begin{bmatrix} 0 & 1\ 1 & 0 \end{bmatrix} = B$ but I want to know if you can find $ P \in GM_2(\mathbb R)$ such that $ B = P^{-1}AP$ .

Similarity of images with tree shape drawn objects

I’m trying to find the best algorithm for my task. I have images looking like this one:

enter image description here

I want to find the most similar ones from a set of such images. I tried several algorithms to compute the similarity value (sift – opencv2, structural similarities – skimage.measure.compare_ssim, pixel similarity, wasserstein_distance…). None of them seems to work as I’d like to. For example sift returned almost maximal similarity for the above image and this one: enter image description here

Which is far from perfect. Are there any algorithms which could perform better on such a task?