Finding the similarity between large text files

My first question is: is there an algorithm that already exists for this? If not any thoughts and ideas are appreciated.

Let’s say I have two large text files (original file A and new file B). Each file is English prose text (including dialogue) with a typical size 256K to 500K characters.

I want to compare them to find out how similar the contents of B are to A.

Similar in this case means: all, or a significant part, of B exists in A, with the condition that there may be subtle differences, words changed here and there, or even globally.

In all cases we have to remember that this is looking for similarity, not (necessarily) identity.

Preprocessing for the text:

  1. Remove all punctuation (and close up gaps “didn’t” -> “didnt”);
  2. Lowercase everything;
  3. Remove common words;
  4. Reduce all whitespace to single space only, but keep paragraphs;

Other possible optimisations to reduce future workload:

Ignore any paragraph of less than a certain length. Why? Because there’s a higher probability of natural duplication in shorter paragraphs (though arguably not in the same overall position).

Have an arbitrary cut-off length on the paragraphs. Why? Mostly because it reduces workload.

Finally:

For every word, turn it into a Metaphone. So instead of every paragraph being composed of normal words it becomes a list a metaphones which help in comparing slightly modified words.

We end up with paragraphs that look like this (each of these lines is a separate paragraph):

WNT TR0 ABT E0L JRTN TTKTF INSPK WLMS E0L UTRL OBNKS JRL TM RL SRPRS LKT TRKTL KM N SX WRT LT ASK W0R RT WRKS T ST N WLTNT RT 0M AL 

But I admit, when it comes to the comparison I’m not sure how to approach it, beyond a brute force take the first encoded paragraph from B (B[0]) and check every paragraph in A looking for a high match (maybe identical, maybe very similar). Perhaps we use Levenshtein to find a match percentage on the paragraphs.

If we find a match at A[n], then check B[1] against A[n+1] and maybe a couple further A[n+2] and A[n+3] just in case something was inserted.

And proceed that way.

What should be detected:

  • Near-identical text
  • Global proper noun changes
  • B is a subset of A

Thanks.

How to load large arrays to gpu and render with OpenGL?

I am trying to make a volumetric rendering of a cloud. I have been defining the cloud density functions on the glsl shaders and performing ray-marching methods successfully. But now I would like to render a 3D grid (100x100x100) representing the density of a cloud that I calculated using the cpu. The idea that I was trying was to make use of the storage buffer objects, but when I access the array to get the density value and render it, it doesn’t work.

This is at the beginning of the glsl fragment:

#version 440 core layout(std430, binding = 3) buffer layoutName {     float data_SSBO[100*100*100]; }; 

And the density function definition is:

float density(vec3 position, float t){     const float dx = 1./100., dy = 1./100., dz = 1./100.;     int i, j, k;     if( (position.x >= 0.)&&(position.y >= 0.)&&(position.z >= 0.)&&(position.x <= 1.)&&(position.y <= 1.)&&(position.z <= 1.)){         i = int(position.x/dx);         j = int(position.y/dy);         k = int(position.z/dz);          return data_SSBO[i*100*100 + j*100 + k];     }     else         return 0.; } 

And in the c code there are the buffer creation, bindings, etc:

    glGenBuffers(1, &ssbo);     glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo);     glBufferData(GL_SHADER_STORAGE_BUFFER, 100*100*100*sizeof(float), grid, GL_STATIC_DRAW);     glBufferSubData(GL_ARRAY_BUFFER, 0, 100*100*100*sizeof(float), grid); 

and in the rendering function there is:

    glClearColor(1.f, 1.f, 0.f, 1.0f);      glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);       glBindBuffer(GL_ARRAY_BUFFER, VBO);       glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo);      glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, ssbo);  //  glBufferSubData(GL_ARRAY_BUFFER, 0, 3*2*2*sizeof(float), buffer);        glVertexAttribPointer(0, 2, GL_FLOAT, GL_FALSE, 2* sizeof(float), (void*)0);      // coordenadas      glEnableVertexAttribArray(0);         glUseProgram(shaderProgram); 

I believe the problem has to do with the binding, I have been trying different combinations, like binding after and before glUseProgram, etc. I literally have no idea what is wrong, I see this is really confusing.

How to check rapidly if an element is present in a large set of data

I am trying to harvest scientific publications data from different online sources like Core, PMC, arXiv etc. From these sources I keep the metadata of the articles (title, authors, abstract etc.) and the fulltext (only from the sources that provide it).

However, I dont want to harvest the same article’s data from different sources. That is, I want to create a mechanism that will tell if an article that I am trying to harvest is present in the dataset of the articles that I already harvested.

The first thing I’ve tried was to see if the article (which I want to harvest) has a DOI and search in the collection of metadatas (that I already harvested) for that that DOI. If it is found there then this article was already harvested. This approach, though, is very time expensive given that I should do a serial search in a collection of ~10 millions articles metadata (in XML format) and the time would increase much more for the articles that don’t have a DOI and I will have to compare other metadatas (like title, authors and date of publication).

def core_pmc_sim(core_article):     if core_article.doi is not None:      #if the core article has a doi         for xml_file in listdir('path_of_the_metadata_files'):  #parse all PMC xml metadata files             for event, elem in ET.iterparse('path_of_the_metadata_files'+xml_file): #iterate through every tag in the xml                 if (elem.tag == 'hasDOI'):                     print(xml_file, elem.text, core_article.doi)                     if elem.text == core_article.doi:  # if PMC doi is equal to the core doi then the articles are the same                         return True                 elem.clear()     return False 

What is the most rapid and memory-efficient way to achieve this?

(Whould a bloom filter be a good approach for this problem?)

mysql – importing large Tablespace: Lost connection during query

iam trying to recover innodb table which has 1.5M rows from ibd file ( 5.5 GB )

this is the exact steps i do:

  1. getting create table query using mysqlfrm command

  2. create the table

  3. Alter Table discard tablespace

  4. moving the new tablespace to the db directory

  5. Alter Table import tablespace;

and i’m getting this error after 5 minutes :-

ERROR 2013 (HY000): Lost connection to MySQL server during query my.cnf:

[client] port=3307 [mysql] no-beep  [mysqld] max_allowed_packet=8M innodb_buffer_pool_size=511M innodb_log_file_size=500M innodb_log_buffer_size=800M net_read_timeout=600 net_write_timeout=600 open_files_limit=100000 skip-grant-tables port=3307 datadir=D:\dbrecover\_home_db_\home\db default-storage-engine=INNODB sql-mode="STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION" log-output=FILE general-log=0 general_log_file="WIN-36LFCDISVVC.log" slow-query-log=1 slow_query_log_file="WIN-36LFCDISVVC-slow.log" long_query_time=10 log-error="WIN-36LFCDISVVC.err" relay_log="WIN-36LFCDISVVC-relay" server-id=1 report_port=3307 lower_case_table_names=2 secure-file-priv="C:/ProgramData/MySQL/MySQL Server 5.7/Uploads" max_connections=151 table_open_cache=2000 tmp_table_size=123M thread_cache_size=10 myisam_max_sort_file_size=100G myisam_sort_buffer_size=236M key_buffer_size=8M read_buffer_size=64K read_rnd_buffer_size=256K innodb_flush_log_at_trx_commit=1 innodb_thread_concurrency=9 innodb_autoextend_increment=64 innodb_buffer_pool_instances=8 innodb_concurrency_tickets=5000 innodb_old_blocks_time=1000 innodb_open_files=300 innodb_stats_on_metadata=0 innodb_file_per_table=1 innodb_checksum_algorithm=0 back_log=80 flush_time=0 join_buffer_size=256K max_connect_errors=100 sort_buffer_size=256K table_definition_cache=1400 binlog_row_event_max_size=8K sync_master_info=10000 sync_relay_log=10000 sync_relay_log_info=10000 

is there any way to import it ??

Go up two sizes: How do I RAW get a Halfling from size Small to size Large?

How do I turn a size Small Halfling into a size Large combatant? I am a GM and the PCs are Level Four. I want to throw the PCs a curve they are not expecting while keeping it “mid level” (not Mythic nor Epic). This is for an NPC.

The Enlarge Person spell says:

This spell causes instant growth of a humanoid creature, doubling its height and multiplying its weight by 8. This increase changes the creature’s size category to the next larger one.

Enlarge Person also says:

Multiple magical effects that increase size do not stack.

The Polymorph spell says it functions like Alter Self. Alter Self says:

When you cast this spell, you can assume the form of any Small or Medium creature of the humanoid type.

Does that mean RAW my Halfling can somehow take a potion of alter self to become size M, then a potion of Enlarge Person to become size L? I suspect this is not intended to work.

While I would like the NPC to be a humanoid, it could be another creature type.

What are the options to go up two sizes, RAW, without getting into spells of levels 7 to 9?

(Edit: I thought Duergar had this ability, but they start at size M not size S.)

Go up two sizes: How do I RAW get a Halfling from size Small to size Large?

How do I turn a size Small Halfling into a size Large combatant? I am a GM and the PCs are Level Four. I want to throw the PCs a curve they are not expecting while keeping it “mid level” (not Mythic nor Epic). This is for an NPC.

The Enlarge Person spell says:

This spell causes instant growth of a humanoid creature, doubling its height and multiplying its weight by 8. This increase changes the creature’s size category to the next larger one.

Enlarge Person also says:

Multiple magical effects that increase size do not stack.

The Polymorph spell says it functions like Alter Self. Alter Self says:

When you cast this spell, you can assume the form of any Small or Medium creature of the humanoid type.

Does that mean RAW my Halfling can somehow take a potion of alter self to become size M, then a potion of Enlarge Person to become size L? I suspect this is not intended to work.

While I would like the NPC to be a humanoid, it could be another creature type.

What are the options to go up two sizes, RAW, without getting into spells of levels 7 to 9?

(Edit: I thought Duergar had this ability, but they start at size M not size S.)

What are alternatives to a large drop down select list?

I am adding a single selection drop-down containing a list of all countries (around 200+ items).

In Windows it works fine as it displays 10-15 countries at a time and the scroll bar shows up. On the Mac, the drop-down shows all the items at once. The users have to scroll down through all items to look for their desired item.

I am not using a text-field as it may require spell check.

What are alternatives to a large drop-down select list that will work similarly across computers?

Algorithm for efficiently sorting large lists based on user preference

I’ll preface this question by saying I’m having a difficult time even formulating the problem, so my explanation might be fuzzy and/or I might be missing obvious solutions.

I have a list of 479 books which I would like to sort based on a “Fuzzy” criterion such as “which books would I like to read before the others in this list?”.

I took a stab at solving this by storing a record for each book in a database, and pre-populating a rank column with a unique sequential number from 1 to 479. For any particular rank, I’d like to read the corresponding book more than a book with a higher rank number. If the rank number is closer to 1, the corresponding book is one I wish to read earlier.

I created an interface that presents me with a choice between two books selected randomly from the database. After I click the book I would rather read first, the following happens:

  • If the rank of the selected book is already lower (more interesting) than the other, I don’t change the rank of either book;
  • If the selected book has a rank that’s higher (less interesting) than the other, I change the selected book’s rank to be the same as the other book, and add 1 to the rank of all the other books where the rank is more than or equal (including the other book, which would now be ranked directly below the selected book).

Finally, for each book I also store a counter of the times it has been evaluated. After I make a selection between two books, this counter increases for both the books that were presented to me. This allows me to avoid presenting books that have already been evaluated a certain number of times until all other books have been evaluated the same number of times.

I found the algorithm to be utterly ineffective: after going through all 479 of the books once, I looked at the list sorted by rank and noticed the list does not reflect at all my own perception of how I’d prioritize these books.

I’m looking for an algorithm that:

  • Allows me to organize the list in an order that I would perceive to be accurate based on my personal notion of which books I’d like to read first;
  • Can prioritize the aforementioned list with as little effort required as possible (i.e. an algorithm that requires the user to compare every book with every other book in the list in order to come to a valid sorting order isn’t ideal).