Practical implementation of a slow lossless compression

I can losslessly compress any file(text, image, random data, video, audio) of size 120kb to 80k, which is more than 30% reduction losslessly. So far the program is really slow, as it takes 10 minutes to reduce file of that size. Though It can be improved to be fast, hopefully to microseconds if not nanoseconds.

Which means using my program we can losslessly compress any 1gb file to 666mb.

  1. Divide a 1GB file into 120kb
  2. We would have 8333 files
  3. Compress each file using my program
  4. Each file would then be 80kb at most
  5. 80kb * 8333 files = 666.7MB

I didn’t really test a 1GB file, because it would be impractical (10 minutes * 8333 files = 1,388 days)

  1. Would it really be practical in the real world if it’s improved?

  2. Is there any method, program, an algorithm that can do same? Because I have read about information theory which says it’s impossible to losslessly compress any file(I guess my program wasn’t ready when they published the article)

Choosing compression method and settings for mp4 files

I’ve got a hard drive filled up with videos, and want them to take up as little space as possible.

 

Video encoding:

  • Around 20000 kbps datarate and bitrate
  • 30 frames/second
  • H.264 AVC

Audio encoding:

  • AAC
  • 96 kbps
  • Stereo channels
  • 48 KHz sample rate

 

They’re separated into multiple folders, and here’s an example folder:

  • 29 files
  • Total duration of around 6 hours
  • Total size of around 25 GB
  • File size tends to vary between 650 MB and 1.25 GB

 

What I want to know is how to choose settings like dictionary size, word size, solid block size, etc. I’m assuming that 7z with LZMA2 is best for the archive format and compression method.

lbzip2 compression files with spaces

I cant compress files with both these commands when files have spaces.

cp --parents $  path /home/fastdownloads/$  {ThisService_ServiceId}/csgo/ lbzip2 /home/fastdownloads/$  {ThisService_ServiceId}/csgo/$  path 

Error output:

[+] Compressing CSO/SF.bz2 cp: cannot stat ‘Ethereal’: No such file or directory lbzip2: skipping "/home/fastdownloads/279/csgo/Ethereal": lstat(): No such file or directory 

The files have spaces and it wont compress them..

How to find optimal token set for compression?

By token I mean the smallest element of source data, that the compression algorithm works on. It may be a bit (like in DMC), a letter (like in Huffman or PPM), a word, or variable-length string (like in LZ).

(Please feel free to correct me on this term, if I use it incorrectly.)

I’m thinking: what if we first find optimal set of tokens, and only then compress our data as a stream of these tokens? Will it improve compression of text or of xml or of some other kind of data? LZ sort of does this, but it is limited to letters. I’m asking about truly general approach when a token can have any bit-length. For example, it can be interesting to do it for compression of x86 executable files. Or in other cases, when sub-byte (and sup-byte) bit-strings behave as individual tokens.

How to find this set of tokens optimally? In a way that maximize average entropy of a stream of such tokens under 0-order Markov model?

Are there existing algorithms for it? Are there any algorithms related to the problem?

How to find approximation of this set? What is complexity of optimal and approximation solution?

And what if we want to maximize entropy for higher order Markov model? How harder the problem will be in this case?

First Huffman Compression Algorithm in C++

I decided to take up learning c++ recently, so I coded a Huffman compression algorithm. I’m particularly interested in critique of my c++ techniques (pointers, references, best practices, etc), but I am open to any comments you have.

Thanks for you time!

#include <map> #include <iostream> #include <vector> #include <deque> #include <string> #include <algorithm>  typedef std::pair<char, int> weight_pair;  struct node {     int weight;     node *parent = NULL;     node *children[2] = {NULL, NULL};     std::string content; };  bool compareNodeWeights(const node *node1, const node *node2) {      return (node1 -> weight) < (node2 -> weight); }  //builds a binary tree from an alphabet and an associated set of weights node *build_tree(const std::map<char, int> &weights){     std::deque<node*> nodes;      for (auto const &x : weights) {         node *leaf_node = new node;          leaf_node -> weight = x.second;         leaf_node -> content = std::string(1, x.first);          nodes.push_back(leaf_node);     }      while (nodes.size() > 1) {          std::sort(nodes.begin(), nodes.end(), compareNodeWeights);          node* new_node = new node;          new_node -> weight = nodes[0] -> weight + nodes[1] -> weight;         new_node -> content = nodes[0] -> content + nodes[1] -> content;         nodes[0] -> parent = new_node;         nodes[1] -> parent = new_node;         new_node -> children[0] = nodes[0];         new_node -> children[1] = nodes[1];          nodes.erase(nodes.begin());         nodes.erase(nodes.begin());          nodes.push_back(new_node);     }      return nodes[0];  }  //traverses the tree to find the prefix code for a given leaf node in a tree std::deque<bool> find_prefix_code(const node *leaf) {     std::deque<bool> result;     const node *curr_node = leaf;      while (curr_node -> parent != NULL) {         if (curr_node -> parent -> children[0] -> content == curr_node -> content) {             result.push_front(0);            }          else {             result.push_front(1);            }          curr_node = curr_node -> parent;      }      return result; }  std::vector<bool> compress(const std::string &message, const node *tree) {     std::vector<bool> result;      std::vector<const node*> open;     open.push_back(tree);      std::map<char, std::deque<bool>> prefix_codes;      while (open.size() > 0) {         const node* curr_node = open[0];           if (curr_node -> content.size() == 1){              prefix_codes.insert( std::pair<char, std::deque<bool>>(curr_node -> content[0], find_prefix_code(curr_node)) );          }          else {             open.push_back(curr_node -> children[0]);                open.push_back(curr_node -> children[1]);            }          open.erase(open.begin());        }      for (const char c : message) {         for (const bool &b : prefix_codes[c]) {             result.push_back(b);         }            }       return result; }  std::string decompress(const std::vector<bool> &data, const node *tree) {     const node *curr_node = tree;     std::string result = "";      for (const bool b : data) {         int direction = b;          curr_node = curr_node -> children[direction];          if (curr_node ->content.size() == 1) {             result += curr_node -> content;             curr_node = tree;            }     }      return result; }  void print_compressed(const std::vector<bool> &data) {     std::cout << "Compressed data: ";      for (const bool b : data) {         std::cout << b;      }      std::cout <<  std::endl; }   void delete_tree(node *tree) {      for (int i = 0; i <= 1; i++) {         if (tree -> children[i] != NULL) {             delete_tree(tree -> children[i]);         }     }      delete tree; }  int main() {     std::map<char, int> weights;      weights.insert(weight_pair(' ', 3));     weights.insert(weight_pair('a', 3));     weights.insert(weight_pair('d', 3));     weights.insert(weight_pair('b', 1));     weights.insert(weight_pair('c', 1));      node *tree = build_tree(weights);        std::vector<bool> compressed_message = compress("a cab dab bad", tree);     print_compressed(compressed_message);         std::string message = decompress(compressed_message, tree);     std::cout << "Decompressed data: " << message << std::endl;      delete_tree(tree);      return 0; } 

Graph Node Compression Algorithm?

I have a collection of graph nodes which are connected to each other.

I wrote a DFS algorithm to recursively search the connectivity of the graph to find all connected nodes, however due to the size of the network, it is taking too long.

To improve the performance, I wanted to assign tags to the nodes, as MAJOR and NOT-MAJOR, such that I can create a new graph that only contains the connections from MAJOR nodes to other MAJOR nodes, sort of illustrated in the photo below where the big circle denotes MAJOR and the red dot denotes NOT-MAJOR:

enter image description here

The desired result would be taking the left graph and compressing it down to the right.

The data I am working through is very large, and there are a lot of interconnections, many islands, and always at least 1 major node.

I have written a second version of my DFS function to take the graph and a major node, traverse each connected node to the major and check if the connected node is a major node, if so then append that major node to a new list, and return the list as the new connections, otherwise return a list only containing the major node itself.

But this is still taking a very long time to pass through the entire network.

Is there a better way I can approach this compression problem?

Run Length Encoding Compression With Nested Level

I am studying about some compression methods, and I’m very interested in RLE Algorithms. But are there any Nested-Level RLE methods, such as:

ABABABCABABABC –> 3(AB)C3AB(C) –> 2(3(AB)C) ?

It’s easily to decompress it by using some techniques like stack, but what is the most efficient way to compress it, any formal algorithms or how to evaluate that?

Is entropy a good indicator of the quality of a lossy compression?

Say I want to quantitively evaluate the effectiveness of several color-to-grayscale conversion algorithms, which can be considered as lossy compression. Would entropy be a good indicator?

To calculate the entropy of a file, I first convert it to a byte stream with hexdump(1) or some other methods (see below). Next, this stream of bytes is considered as a Markov chain, and its transition matrix is estimated by examining consecutive bytes, from which entropy is calculated. The stationary distribution can be estimated by counting all bytes in the file if needed.

There are three caveats, however:

  1. Sometimes you need to use longer “byte”s. For example, if the file is encoded with UTF-16, then an atomic unit should contain exactly 16 bits. Fortunately, almost all pictures use True color (24-bit), or 8-bit or RGB respectively, so this won’t be a big problem.
  2. Pictures are 2D structures, and hexdump basically coerces them to 1D linear structures, but that’s inappropriate. The file should be traversed in another way so that the positional change is as smooth as possible, like
* - *   * - *   /   /   / *   *   *   * | /   /   / | *   *   *   *   /   /   / * - *   * - * 
  1. The input and output of algorithms to be compared against each other must be the same.

Any thought is appreciated!