Covering lemmas in Hochman’s “On self-similar sets with overlaps and inverse theorems for entropy”

I am confused about the covering lemmas in the captioned work and really hope to get some ideas here.

Firstly it is lemma 3.7. (Image of Lemma 3.7) (for convenience here is the lemma of this lemma (Image of Lemma 3.6))

I do not understand the ”$ \supseteq$ ” in the line ”$ J’\cap(J’-\ell)\supseteq \cdots$ ”. Also I am wondering if the following provides a counter-example of this line and this lemma:

Let $ n=12$ , $ I=\{0,1,2,3,4,5,6,7,8\}$ , $ m=3$ . Then $ I’=\{0,4,8\}$ by the construction in Lemma 3.6. Let $ J=\{0,2,4,6,8,10,12\}$ , $ \delta=0.5$ . Then for all $ i\in I$ , $ |[i,i+3]\cap J|=2\geq(1-\delta)m$ . Take $ \ell=1$ . Then for all $ J’\subseteq J$ , we have $ J’\cap(J’-\ell)\subseteq J\cap(J-\ell)=\emptyset$ , $ \bigcup_{i\in I’} J\cap[i,i+2]=\{0,2,4,6,8,10\}$ , and $ (1-\delta-\frac{\ell}{m})|I|=(\frac{1}{2}-\frac{1}{3})\cdot 9 >0$ .

Secondly, it is lemma 3.8.(Excerpt for Lemma 3.8), which I am not sure how the marked inequality is obtained.

Since subsequent results depends on these lemmas and this paper is already acknowledged in the field, I think there should be an answer for resolving my confusion.

Thanks for help!

Is there a less than $O(n)$ algorithm for converting UTF-8 character offsets to byte offsets, in a gap buffer?

A Gap Buffer is a variation on a dynamically-sized array, but with a gap inside it. The gap makes editing operations around the gap more efficient. Deletion before the gap can be implemented by simply making the gap larger for example.

UTF-8 is a variable width encoding for text. The first bits of the first byte of a character describes how many bytes are in the character, to a maximum of four.

When describing a cursor position in a string, (specifically a position between characters), we can use a pair consisting of the line number, (let’s say we start at zero), and the horizontal character offset (how many characters to the right the cursor is from the position to the left of the first character on the line). It is useful to use this representation to position a cursor.

However, in order to move the byte offsets that determine where the buffer’s gap is we need to convert the line number and character offset to a byte index.

The best way I currently know how to do this is the following $ O(n)$ algorithm:

Call the line number and character offset we are looking for the byte offset of, "the target".  Keep track of the current line number and character offset as we go, starting each at 0.  for (the characters before the gap) {     if the current line number and character offset matches the target,         return the byte offset. }  if the space right after the last character before the gap matches the target,     return the byte offset of the start of the gap.  for (the characters after the gap) {     if the current line number and character offset matches the target,         return the byte offset. }  if the space right after the last character after the gap matches the target,     return the byte offset of the end of the buffer.  Otherwise, the cursor is out of bounds. 

This is all under the assumption that the buffer is well-formed. That is, the gap starts immediately after a UTF-8 character, and the gap ends just before another one, or the end of the entire buffer.

Is there a way to do this with a lower computational complexity than $ O(n)$ ? If I try to imagine one, the closest I can get is to try something like binary search, but that seems like finding a pivot point (past maybe the first one which we could cache,) would involve iterating over the buffer anyway, so it wouldn’t actually be $ O(\log n)$ .

Can all $O(n)$ problems be solved without nested loops?

There are examples of algorithm implementations that contain nested loops but are of complexity O(n), and some of them have corresponding implementations that contain no nested loops. So here comes a question, can all such implementations be simplified or converted to an implementation with only top layer loops? Namely, can all problems that have an $ O(n)$ algorithm be solved with an algorithm without nested loops?

Does the fact that the amortized cost of each extract-max operation is $O(1)$ mean that the entire sequence can be processed in $O(n)$ time

Consider an ordinary binary max-heap data structure with n elements that supports insert and extract-max in $ O(log(n))$ worst-case time.

Q) Consider a sequence of n extract-max operations performed on a heap H that initially contains n elements. Does the fact that the amortized cost of each extract-max operation is $ O(1)$ mean that the entire sequence can be processed in $ O(n)$ time? Justify your answer.

Ans: No, A sequence of n EXTRACT_MAX will cost $ O(nlogn)$ . This is because in a Heap the leaves are almost on the same level and the number of leaves in a Heap is $ O(n/2) = O(n)$ . Calling EXTRACT_MAX each time removes the maximum element from the root and replace it with a new second maximum element and decrement a leaf. Thus, decrementing all the leaves (which are of $ O(n)$ ) will take logn time each as all leaves are almost in the same level totally giving a time complexity of $ O(nlogn)^{1/4}$ .

does this make sense?

Show that the following algorithm takes $O(n)$ time

You are given a linked list of size $ n$ . An element can be accessed from the start of the list or the end of the list. The cost to access any location is $ \min(i,n-i)$ , if the location being accessed is at index $ i$ and it belongs to a list of size $ n$ . Once an index $ i$ is accessed, the list is broken into two lists. One list contains the first $ i$ elements and the second list contains the rest of the elements. It has something to do with cartesian trees, but I am not clear how to proceed with this chain of thought.

Show that the total cost incurred to access all the elements is any arbitrary order is $ O(n)$ .

Efficiently shuffling items in $N$ buckets using $O(N)$ space

I’ve run into a challenging algorithm puzzle while trying to generate a large amount of test data. The problem is as follows:

  • We have $ N$ buckets, $ B_1$ through $ B_N$ . Each bucket $ B_i$ maps to a unique item $ a_i$ and a count $ k_i$ . Altogether, the collection holds $ T=\sum_1^N{k_i}$ items. This is a more compact representation of a vector of $ T$ items where each $ a_i$ is repeated $ k_i$ times.

  • We want to output a shuffled list of the $ T$ items, all permutations equally probable, using only $ O(N)$ space and minimal time complexity. (Assume a perfect RNG.)

  • $ N$ is fairly large and $ T$ is much larger; 5,000 and 5,000,000 in the problem that led me to this investigation.

Now clearly the time complexity is at least $ O(T)$ since we have to output that many items. But how closely can we approach that lower bound? Some algorithms:

  • Algorithm 1: Expand the buckets into a vector of $ T$ items and use Fisher-Yates. This uses $ O(T)$ time, but also $ O(T)$ space, which we want to avoid.

  • Algorithm 2: For each step, choose a random number $ R$ from $ [0,T-1]$ . Traverse the buckets, subtracting $ k_i$ from $ R$ each time, until $ R<0$ ; then output $ i$ and decrement $ k_i$ and $ T$ . This seems correct and does not use extra space. However, it takes $ O(NT)$ time, which is quite slow when $ N$ is large.

  • Algorithm 3: Convert the vector of buckets into a balanced binary tree with buckets at the leaf nodes; the depth should be close to $ \log_2{N}$ . Each node stores the total count of all the buckets under it. To shuffle, choose a random number $ R$ from $ [0,T-1]$ , then descend into the tree accordingly, decrementing each node count as we go; when descending to the right, reduce $ R$ by the left count. When we reach a leaf node, output its value. It uses $ O(N)$ space and $ O(T\log{N})$ time.

  • Algorithm 3a: Same as Algorithm 3, but with a Huffman tree; this should be faster if the $ k_i$ values vary widely, since the most often visited nodes will be closer to the root. The performance is more difficult to assess, but looks like it would vary from $ O(T)$ to $ O(T\log{N})$ depending on the distribution of $ k_i$ .

Algorithm 3 is the best I’ve come up with. Here are some illustrations to clarify it:

Illustrations of Algorithm 3

Does anyone know of a more efficient algorithm? I tried searching with various terms but could not find any discussion of this particular task.