Postgres REINDEX time estimate

I’ve got an older DB (postgres 10.15) that’s not yet been upgraded. One problematic table had a few large indexes on it, some of which were corrupt and needed reindexing. Since it’s not on version 12+, I can’t concurrently reindex the table (which means I need to do it non-concurrently, which requires a table write lock) – so I wanted to know how I could do some rough calculations on how long the reindex would take so I can plan some maintenance. Most of my research ends up in the "just use pg_stat_progress_create_index! (which isn’t available in 10), or people just saying to use CONCURRENTLY.

The table is ~200GB, and there are indexes are 7 indexes which are 14GB each (as per pg_relation_size). I can get ~900M/s constant read-rate on the DB for this task. Is there a simple metric I can use to determine how much data will be required to be read to reindex fully?

How to estimate the complexity of sequential algorithm given that we know the complexity of each step?

First case: I was stumble upon a two step sequential algorithm where the big O complexity of each step is O(N^9).

Second case: Also if the algorithm have three steps where the complexity of step 1 is O(N^2), the complexity of step 2 is O(N^3) and the complexity of step 3 is O(N^9)

What would be the complexity of the first case and second case ?

How to communicate a SWAG estimate

I support a large enterprise software project which frequently receives enhancement requests from our customer. The customer will only pay for the work up front in fixed-price contracts. We provide a SWAG estimate first, then provide a detailed estimate after they green light the SWAG. The detailed estimates are time consuming and we are only compensated for the estimation time when they sign off on the enhancement, so the SWAG estimate provides a level of protection for us.

We communicate our SWAG estimate as Small, Medium, Large, or Very Large, and we have communicated ranges associated with each of these values. For example,

Small: < 5 days Medium: 5 - 15 days Large: 15 - 50 days Very Large: > 50 days 

Having put this into practice for a couple of years, I have some concerns:

  1. Sometimes the SWAG estimate is expected to be at the high end of a range. This form of estimate can make it difficult to manage customer expectations, a 15 day effort is very different from a 50 day effort, and the customer can green light a SWAG under the optimistic assumption.
  2. If the customer approves of a SWAG estimate, we can feel obliged to cap our detailed estimate at the high range of the SWAG. If we move up a range then there are usually additional billing discussions which are painful, slow, and lacking guarantees for compensation.

Are there any kind of standard practices, or tried and true methods for communicating SWAGs that can help us better manage customer expectations?

How do you estimate the code size of a C library? [on hold]

When trying to select an open source library based on code size constraints, is there a simple way to estimate what the compiled size of a C library would be?

Integrating the library into the project would work to determine it but I’m hoping there is some sort of solution that exists to estimate it separate from any projects.

I’m looking to estimate the size of “MQTTPacket” from the following library: https://github.com/eclipse/paho.mqtt.embedded-c

How do you empirically estimate the most popular seat and get an upper bound on total variation?

Say there are $ n$ seats $ \{s_1, …, s_n\}$ in a theater and the theater wants to know which seat is the most popular. They allow $ 1$ person in for $ m$ nights in a row. For all $ m$ nights, they record which seat is occupied.

They are able to calculate probabilities for whether or not a seat will be occupied using empirical estimation: $ P(s_i ~\text{is occuped})= \frac{\# ~\text{of times} ~s_i~ \text{is occupied }}{m}$ . With this, we have an empirical distribution $ \hat{\mathcal{D}}$ which maximizes the likelihood of our observed data drawn from the true distribution $ \mathcal{D}$ . This much I understand! But, I’m totally lost trying to make this more rigorous.

  • What is the upper bound on $ \text{E} ~[d_{TV}(\hat{\mathcal{D}}, \mathcal{D})]$ ? Why? Note: $ d_{TV}(\mathcal{P}, \mathcal{Q})$ is the total variation distance between distributions $ \mathcal{P}$ and $ \mathcal{Q}$ .
  • What does $ m$ need to be such that $ \hat{\mathcal{D}}$ is accurate to some $ \epsilon$ ? Why?
  • How does this generalize if the theater allows $ k$ people in each night (instead of $ 1$ person)?
  • Is empirical estimation the best approach? If not, what is?

If this is too much to ask in a question, let me know. Any reference to a textbook which will help answer these questions will happily be accepted as well.