Efficiently computing lower bounds over partially ordered sets

I have a list of sets that I would like to sort into a partial order based on the subset relation.

In fact, I do not require the complete ordering, only the lower bounds.

If I am not mistaken, each lower bound should define one separate component of the respective graph – and this component should be a meet-semilattice.

What would be the most convenient space and time efficient way to solve this problem? Perhaps there is a way that does not require to build the entire graph? Perhaps there is a known algorithm under a better terminology than what I have naively described above?

I am aware that the time and space requirements are underspecified above, but I would be happy about any suggestions, whether they are proven to be optimal or not…

Background: I am currently building an entire graph database that holds all the edges between the sets and then look for the nodes that have no generalizations, but this is quite complicated, slow and requires a lot of (disk) space. The list mentioned above contains roughly 100 million sets.

A data structure for efficiently finding nodes relative to other (ex: if a node is earlier in a list than another node)

Suppose we have N elements, which we’ll treat as a simple object. Is there a data structure I can use that will allow me to see which node appears earlier based on some arbitrary insertion order given a reference to both nodes? I’ll need some kind of data structure $ \mathcal{D}$ , where $ \mathcal{D}$ supports the following operations (assuming $ A$ and $ B$ are node references, and $ A \neq B$ ):

$ \mathcal{D}.addBefore(A, B)$

$ \mathcal{D}.addAfter(A, B)$

$ \mathcal{D}.addLast(A)$

$ \mathcal{D}.earlier(A, B) \rightarrow bool$

$ \mathcal{D}.remove(A)$

I’d like to implement some kind of $ Earlier$ predicate which takes two nodes and returns whether A comes before B. For example, if this was an indexed list and we had the index of the nodes, then it’d be simply:

$ $ Earlier(A, B) \implies A.index < B.index$ $

The ordering is determined by a user who inserts them as they see fit. They are allowed to add either after some node, or before some node, or if the data structure is empty then they can just add it and the element that was added first becomes the only element in the data structure until another element is added.

A practical example of this problem is that a user is pasting files into a directory, but the file explorer lets the user paste files before or after any file in the list. The file explorer must display the files in order that the user requests, so if a list is used to hold the files, then [A, B, C] should render as [A, B, C], and if the user pastes a file D before B, then the list should render [A, D, B, C].

This becomes a problem when I need to insert before another item: I don’t have that luxury since inserting into the middle of a list backed by an array has a big overhead. My next thought was to go with a linked list because I will have references to the two nodes and can quickly insert with my handle to the node. This is $ \mathcal{O}(1)$ for insertion.

The actual problem I have is that insertions are not too frequent, but searching for which node comes first between two given nodes in the data structure is a common operation. This makes the naive $ \mathcal{O}(n)$ search pretty painful when dealing with a lot of nodes in the list as I have to search all the other nodes in the list at the worst case to determine which one is behind/ahead of the other.

My main roadblock is that since the user can insert them in any order (and it needs to stay in the order the user inserts them in), I have to use some data structure that maintains this invariant.

As such, with a linked list I am stuck currently at:

$ $ Earlier \in \mathcal{O}(n)$ $

and iterating over the list is of course $ \mathcal{O}(n)$ , along with removal being $ \mathcal{O}(1)$ since it’s trivial to unlink a node with a reference to it in a doubly linked list.


My solution to the problem:

Now, we can change the data structure if we want, so a linked list isn’t required. The only thing that is required is the ability to let the users iterate over the data structure and get the elements back in the order they placed them in.

This makes me think of whether there’s a tree structure I can use. For example, what if I was to take a binary tree that balances itself such that the search depth is approximately $ \mathcal{O}(\lg n)$ , maybe something like a self-balancing tree. The first thing that jumps to mind is an AVL tree where I’d track the sizes of the trees in balance and then update them. This isn’t quite an AVL tree since there’s no implicit ordering between the nodes, but the idea of self-balancing is something I’d like to exploit to get a good search runtime.

To make this viable, our users will have references to our nodes. This way we can put each node in a hash table and do an $ \mathcal{O}(1)$ lookup to find the node in the tree. Inserting a node before or after it is not too bad since you create a new subtree from the current node by adding a parent node and making the previous node into a leaf and adding the element as another leaf. To make this visually make sense:

    o                   o    / \     add A       / \      rebalance   o   o   ------->    o   o    ---------->  ...  / \      before X   / \        if needed o   X               o   o                        / \                       A   X 

or

    o                   o    / \     add A       / \      rebalance   o   o   ------->    o   o    ---------->  ...  / \      after X    / \        if needed o   X               o   o                        / \                       X   A 

where o is another node (that is either a parent, or a leaf).

A consequence of this data structure is that it is a full binary tree and each leaf is a value we’re storing, and the parent nodes do not store any value.

The cost of adding a node to a self balancing binary search tree is $ \mathcal{O}(1)$ to place it at the node since we assume we can look up the node reference from a hash table, and then $ \mathcal{O}(1)$ to insert it by adding a parent and attaching the two nodes, and then $ \mathcal{O}(\lg n)$ to rebalance the tree. This means insertion is $ \mathcal{O}(\lg n)$ . Not too bad.

Searching for an element to see which comes earlier becomes a “traverse from both nodes up to the root and find the least common ancestor, and whichever comes from the left branch is earlier”, which is $ \mathcal{O}(\lg n)$ . Searching is now logarithmic as well.

As such, this means we now get:

$ $ Earlier \in \mathcal{O}(\lg n)$ $

Further, iterating over the binary tree is $ \mathcal{O}(n)$ since it’s a full binary tree and at worst there should be approximately $ 2n$ nodes to visit in total. Since the naive list solution previously was $ \mathcal{O}(n)$ , we’re looking good.

Finally, removal is probably the same as AVL tree removal and thus also $ \mathcal{O}(\lg n)$ .


But can we do better?

Overall the above solution is decent, but it would be really nice if I could knock the searching down to $ \mathcal{O}(1)$ if possible or something really small like how disjoint sets are $ \mathcal{O}(\alpha(n))$ for some operations and feel effectively constant.

Is it possible to do something like this in $ \mathcal{o}(\lg n)$ time? I am willing to trade away performance on addition, deletion, and iteration to get a better search time, as that is my bottleneck.

I don’t know what other data structures are out there, maybe you know. Or maybe there is some other method I can use that I don’t know about which would allow me to achieve very quick search times. I can augment the data structures, so that is always an option on the table.

I also understand that getting a better runtime might require going to the literature and implementing some exotic data structure, to which the cost of implementing and maintaining it may be more than it’s worth, and as such maybe the balancing binary tree might be the only viable solution since this is not Google-level data sizes and doesn’t need such a solution. Since this is a problem I have in a hobby project, I figure I can try out things with little repercussion.

Efficiently extendable hash function?

I’m wondering whether there exists some good known hash functions with the following property: Assume that $ x$ is some string over some alphabet $ A$ , then given $ H(x)$ we can compute in $ O(1)$ time both $ H(ax)$ and $ H(xa)$ for any letter $ a\in A$ . In practice one can assume that $ A$ is e.g. the set of $ 8$ -bit integers.

In other words, a hash function for strings that can quickly be extended in both directions. I’m of course, only interested in hash functions that actually distribute the data well and which are very fast to compute in practice.

How to check whether given $k$ vertices make a $k$-clique in an undirected graph $G$ efficiently

Let $ G=(V, E)$ be an undirected graph with vertex set $ V$ and edge set $ E$ . Let $ V’=\{v_1, v_2, …, v_k\}$ be a subset of $ V$ where degree of each $ v_i$ is bigger than or equal to $ k$ . Is there a way to check whether $ V’$ is a $ k$ -clique more efficient than the brute-force algorithm in $ O(k^2)$ ? Thanks in advance.

Deleting multiple items in a shopping cart efficiently

My boss is insistent we need to build functionality to allow users to delete multiple items out of their shopping cart at once (e.g., checkboxes, with a “Select All” button/link, and “Delete” at the top and bottom of pages).

Currently we have the far more typical “Delete” option at the item/product level.

I’ve researched dozens of competitor carts and not one offers delete-multiple-items-at-once functionality. Edit: AmazonSupply does this, but at the expense of line-item delete.

I’d love to be able to push back on this and prove it’s a bad user experience – but honestly the requestor is so adamant about it it’s probably an unwinnable discussion.

Anyway does anyone have examples of good implementations of this functionality? It seems like it’s going to add a lot of clutter and visual noise and overall is an atypical implementation. Our site has some “Save for Later” functionality multi-select might be beneficial for, but overall I’m concerned about implementing it as it doesn’t seem widely done.

Anyway if anyone has good examples of multi-select delete/save in Shopping Carts — or compelling articles why that functionality shouldn’t be used for me to try to beat the requirement back with any help would be appreciated — thanks!

Efficiently populate a look-up table for a function over a range of arguments in Python

I am minimizing a scalar function which takes a n-dimensional vector input and outputs the scalar value, n-dim vector (Jacobian), and an nxn matrix (Hessian). Given a range of the elements in my input vector I’m looking for a efficient way to precalculate the outputs in an efficient to access format.

I’ve been thinking of a scheme based on numpy.interpn with a regularly spaced grid of inputs, but this only allow for linear interpolation of intermediate values, and requires regular sampling.

I’m hoping there is a tool available that does this with a more intelligent method? Perhaps with automatic refinement of inputs sampled or a more sophisticated interpolation scheme?

Technically, the scalar output contains the Jacobian and Hessian but I need those with decent fidelity, so I would either need a higher order representation of the scalar function (with added sampling frequency) or I ccan interpolate on the Jacobian and Hessian directly (as they are outputted by the minimizer anyway)

Thanks

Algorithm for efficiently sorting large lists based on user preference

I’ll preface this question by saying I’m having a difficult time even formulating the problem, so my explanation might be fuzzy and/or I might be missing obvious solutions.

I have a list of 479 books which I would like to sort based on a “Fuzzy” criterion such as “which books would I like to read before the others in this list?”.

I took a stab at solving this by storing a record for each book in a database, and pre-populating a rank column with a unique sequential number from 1 to 479. For any particular rank, I’d like to read the corresponding book more than a book with a higher rank number. If the rank number is closer to 1, the corresponding book is one I wish to read earlier.

I created an interface that presents me with a choice between two books selected randomly from the database. After I click the book I would rather read first, the following happens:

  • If the rank of the selected book is already lower (more interesting) than the other, I don’t change the rank of either book;
  • If the selected book has a rank that’s higher (less interesting) than the other, I change the selected book’s rank to be the same as the other book, and add 1 to the rank of all the other books where the rank is more than or equal (including the other book, which would now be ranked directly below the selected book).

Finally, for each book I also store a counter of the times it has been evaluated. After I make a selection between two books, this counter increases for both the books that were presented to me. This allows me to avoid presenting books that have already been evaluated a certain number of times until all other books have been evaluated the same number of times.

I found the algorithm to be utterly ineffective: after going through all 479 of the books once, I looked at the list sorted by rank and noticed the list does not reflect at all my own perception of how I’d prioritize these books.

I’m looking for an algorithm that:

  • Allows me to organize the list in an order that I would perceive to be accurate based on my personal notion of which books I’d like to read first;
  • Can prioritize the aforementioned list with as little effort required as possible (i.e. an algorithm that requires the user to compare every book with every other book in the list in order to come to a valid sorting order isn’t ideal).

How to get the oldest values over each id in a PostgreSQL table efficiently?

How can PostgreSQL return a list of the oldest timestamp values over a table of sensor id measurements?

Let me explain the situation with a sample table:

CREATE TABLE sensor_data( sensor_id INTEGER, time TIMESTAMPTZ, value NUMERIC, PRIMARY KEY (sensor_id, time) ) 

Populated table example:

+-----------+------------------+-------+ | sensor_id |       time       | value | +-----------+------------------+-------+ |         1 | 2018-01-01 00:00 |     1 | |         1 | 2018-01-01 01:00 |     2 | |         3 | 2018-01-01 03:00 |     4 | |         3 | 2018-01-01 04:00 |     3 | |         4 | 2018-01-01 03:00 |     5 | |         4 | 2018-01-01 04:00 |     6 | +-----------+------------------+-------+ 

While using something like sensor_id (1,3) inside the query I want it to return something like this:

+-----------+------------------+-------+ | sensor_id |       time       | value | +-----------+------------------+-------+ |         1 | 2018-01-01 01:00 |     2 | |         3 | 2018-01-01 04:00 |     3 | +-----------+------------------+-------+ 

How can I do that in a query using the PRIMARY KEY index for speeding it up?

efficiently minimizing quasiconvex function

The problem is as follows: given black box access to a strictly quasiconvex (or unimodal) function $ f$ on an interval, say [0, 1], query $ f$ repeatedly at various points to find a subinterval containing the minimum. After $ n$ queries to $ f$ , if the interval’s length is reduced by a factor of $ t$ , the algorithm’s score is $ \sqrt[n]{t}$ , which is the amortized factor by which each query to $ f$ reduces the length of the interval.

For example, standard ternary search starts with $ f(0)$ and $ f(1)$ , additionally queries $ f(\frac{1}{3})$ and $ f(\frac{2}{3})$ , and obtains a subinterval with size $ \frac{2}{3}$ . It then iterates and each subsequent iteration starts with $ f$ evaluated at the endpoints of the subinterval and reduces the subinterval’s length by a factor of $ \frac{2}{3}$ . Thus the score is $ \sqrt{\frac{2}{3}}$ .

A better ternary search instead queries at $ \frac{1}{2}-\varepsilon$ and $ \frac{1}{2}+\varepsilon$ and iterates for a score of $ \sqrt{\frac{1}{2}+\varepsilon}$ .

An even better algorithm is “pentanary search”, starting with $ f(0)$ , $ f(\frac{1}{2})$ , and $ f(1)$ and querying $ f(\frac{1}{4})$ and $ f(\frac{3}{4})$ to obtain a subinterval that is the first, middle, or last half of $ [0, 1]$ . It then iterates. The key is that 3 points from each iteration can be used in the next. Thus the score is $ \sqrt{\frac{1}{2}}$ .

The best I know of, “golden search”, starts with $ f(0)$ , $ f(1)$ , and either $ f(1-\frac{1}{\phi})$ or $ f(\frac{1}{\phi})$ , querying for the other of those two, where $ \phi$ is the golden ratio. The subinterval is either $ [0, \frac{1}{\phi}]$ or $ [1-\frac{1}{\phi}, 1]$ and 3 values from the previous iteration can be reused, achieving a score of $ \frac{1}{\phi}$ .

Can we do better than “golden search”? Do we know the optimal score for this problem?