JSONB Array of Strings (with GIN index) versus Split Rows (B-Tree Index)

I have a database which stores receiver to indicate which account the data relates to. This has led to tons of duplication of data, as one set of data may create 3 separate rows, where the only difference is the receiver column.

|---------------------|------------------|---------------------|------------------| |      Receiver       |       Event      |         Date        |      Location    | |---------------------|------------------|---------------------|------------------| |       Alpha         |         3        |          12         |         USA      | |---------------------|------------------|---------------------|------------------| |       Bravo         |         3        |          12         |         USA      | |---------------------|------------------|---------------------|------------------| |       Charlie       |         3        |          12         |         USA      | |---------------------|------------------|---------------------|------------------| 

While redesigning the database, I have considered using an array with a GIN index instead of the current B-Tree index on receiver. My proposed new table would look like this:

|-------------------------------|--------------|------------|-------------------| |           Receivers           |     Event    |    Date    |     Location      | |-------------------------------|--------------|------------|-------------------| | ["Alpha", "Bravo", "Charlie"] |       3      |     12     |         USA       | |-------------------------------|--------------|------------|-------------------| 

More Information:

  • Receiver names are of the type (a-z, 1-5, .)
  • 95% of all queries currently look like this: SELECT * FROM table WHERE Receiver = Alpha, with the new format this would be SELECT * FROM table WHERE receivers @> '"Alpha"'::jsonb;
  • The table currently contains over 4 billion rows (with duplication) and the new proposed schema would cut it down to under 2 billion rows.

Question:

  1. Does it make more sense to use Postgres Native Text Array?
  2. Would a jsonb_path_ops GIN index on receivers make sense here?
  3. Which option is more efficient? Which is faster?

Induction on strings (words)

Given is an alphabet Σ = { 0, 1, 2 } and a function cross to calculate the cross sum of a word.

cross : Σ* → N with:

cross(w)= 0 when w = ε

        cross(v) + x when w = vx with x ∈ Σ 

Prove by Induction on words that ∀w ∈ Σ*. cross(w) <= 2*|w|

I can prove that the statement holds for ε

Base case: cross(ε) = 0 <= 2 * |ε| = 0

How can I show that the statement holds in the inductive step?

How to make Rules ignore case sensitivity when comparing strings

I have an issue where I want to use Text Comparison in the Drupal Rules module but sometimes the case doesn’t match (one is provided by user input and the other is provided by user data). It’s comparing 2 emails.

The rule fires as expected if the text matches exactly, but if the text doesn’t match case (ie example@example.com != Example@example.com) it doesn’t fire.

Is there a way to get rules to ignore case?

The data that I’m getting to compare is as follows in text comparison:
user-data:mail
person-email

The user-data:mail is the email of the user who is submitting the webform, the person-email is the provided data to a ruleset coming from the webform that the user entered.

Numpy Array of Strings

I am not sure that this question is suitable for here, but since it is about data science python module, that is numpy, I decided to post it.

In python, empty string consumes 49 bytes because it contains some data fields even though it does not contain any character. This can be tested like this:

import sys str1 = "Hello" str2 = ""  print(sys.getsizeof(str2)) #49 print(sys.getsizeof(str1)) #54 

This code also shows that 1 byte is allocated for one character.

After that, I define a numpy array of two strings, and I check the size of a character with this test code:

import numpy np_arr = numpy.array(["Hello", "George"])  print(np_arr.dtype.itemsize) #24 

For string objects, the data type unicode is used, longer string is determined, and a string object is defined for each string in numpy array. In this point, longer string is “George”; hence, itemsize is returned as 24. If we use the string “Destructive”, itemsizewill be 44. This means that 4 byte is allocated for each character.

Is there anyone who can explain its reason ? Why is 1 byte allocated for standard string character and 4 byte allocated for numpy string character ?

How can I include common strings (regexes) in several projects written in different languages?

I have a simple Go library (~300 lines, mostly type declarations and nice methods and compatibility methods for gomobile etc., also pretty-printing scripts). The heart of the whole project are two regexes, placed in a separate file in my Go repository.

Now I want to port that simple library to Python, Java (Android), maybe Javascript. I want the regexes to rest in a single Git repository (maybe even the Golang one).

What is the best way to include the regexes at compile/build time as string constants? In case of Golang, it seems that the only way is to use go generate, which generates source code files and could include those regexes from a plain text file. So I decided to make a separate Go source code file for easy parsing by other build tools in case of Python, Java, Javascript. I would appreciate any advice how to do it efficiently that way. I could also make a separate repository for two regexes and use git submodules (or even treat it as a sepratate Go package in Go project).

But I’ve lost my hope for a pretty solution, and that’s why I’m asking here and not on SO; how do you solve such problems? The simplest way would be to manually “cherry-pick” commits from each repository, setting author field manually.

Function to convert strings from hmmss, mmss, ss format to milliseconds

Here’s the current logic to convert strings in the format hh:mm:ss, mm:ss or ss to milliseconds.

Any comments on how to improve this?

hhmmssToMillis(hhmmss) {    let time= hhmmss.split(':').reverse();    let millis = 0;    switch(time.length) {      case 1:           millis = parseInt(time[0]) * 1000;          break;      case 2:           millis = (parseInt(time[1]) * 60 + parseInt(time[0])) * 1000;          break;      case 3:           millis = (parseInt(time[2]) * 60 * 60 + parseInt(time[1]) * 60 + parseInt(time[0])) * 1000;          break;    }     return millis;  } 

Finding different char from 2 given strings

I believe I have the code right for this particular question already but I do have some follow up questions. I’m still fairly new to this and I got it to spit out what I was looking for. If there are room for improvements I am open to criticisms. I’m just looking to get better.

How would I approach this differently or how would I solve it if the 2 given Strings are VERY large and the memory is limited?

import java.util.ArrayList;  public class DifferChar {  public char diffChar(String str1, String str2) { ArrayList<Character> al1 = new ArrayList<Character>(); ArrayList<Character> al2 = new ArrayList<Character>(); String longer; String shorter; char c = '\u0000';  if (str1 != null && str2 != null) {     if (str1.length() > str2.length())     {         longer = str1.toUpperCase();         shorter = str2.toUpperCase();     }     else     {         longer = str2.toUpperCase();         shorter = str1.toUpperCase();     }      if (longer.length() - shorter.length() <= 1)     {         for (char ch : shorter.toCharArray())             al1.add(ch);         for (char ch : longer.toCharArray())             al2.add(ch);         for (int i = al1.size()-1; i >= 0; i--)         {             if (al2.contains(al1.get(i)))                 al2.remove(al1.get(i));         }         if (al2.size() == 1)             c = al2.get(0);     } } return c; }  public static void main(String[] args) { String str1 = "aklwejr"; String str2 = "aklwej";  DifferChar diff = new DifferChar(); System.out.println(diff.diffChar(str1, str2)); } }