Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible new test: compressibility #270

Open
o0101 opened this issue Jun 28, 2023 · 12 comments
Open

Possible new test: compressibility #270

o0101 opened this issue Jun 28, 2023 · 12 comments

Comments

@o0101
Copy link
Contributor

o0101 commented Jun 28, 2023

You might not want to do this @rurban or smhasher community but I think one nice idea when developing random/pseudo-random functions is to test how compressible they are.

So loop them on their own output to generate say 1 megabyte of binary output, and then run that through gzip or whatever. If it can't compress at all, then it's a good sign. If it can compress, then it shows that there may be statistically regularities to the system that are perhaps not detected by other tests.

Thanks for reading this! I know you're busy, no worries if you can't get to it!

@o0101
Copy link
Contributor Author

o0101 commented Jun 28, 2023

One advantage of these tests is they are pretty fast. May be tricky to integrate tho, I'm not sure how simple the various lib are.

@o0101
Copy link
Contributor Author

o0101 commented Jun 28, 2023

For example, discohash, which passes all quality tests here, if given to the test above, produces an output stream that is ~20% compressible using gzip -9. This indicates that the tests above miss something about statistical regularity that is nevertheless picked up quickly by the compressibility test.

@rurban
Copy link
Owner

rurban commented Jun 28, 2023

That is indeed a very good idea, that I had by myself also some years ago. It should detect hidden patterns. But gzip is not really good enough, a neural net compressor would be better. Forgot what I tried then, the state of the art compressor then with a jitted NN

@avaneev
Copy link
Contributor

avaneev commented Jun 28, 2023

LZMA should be good, it's freely available open source library, can be integrated without much hassle. LZMA is also known as one of the best available compression algorithms, it's very hard to get much better, be it NN or not.
https://7-zip.org/sdk.html

@o0101
Copy link
Contributor Author

o0101 commented Jun 28, 2023

Good idea!

@o0101
Copy link
Contributor Author

o0101 commented Jun 30, 2023

For reference, here's the code I used to create the infinite stream:

#include <iostream>
#include <vector>
#include <fstream>
#include <inttypes.h>
#include <stdexcept>
#include <string>
#include <cstring>

#include "discohash.h"

#ifdef _WIN32
#include <io.h>
#include <fcntl.h>
#define SET_BINARY_MODE(handle) _setmode(_fileno(handle), _O_BINARY)
#else
#define SET_BINARY_MODE(handle) ((void)0)
#endif

void readFileToBuffer(const std::string& filename, std::vector<uint8_t>& buffer) {
  std::ifstream file(filename, std::ios::binary | std::ios::ate);
  if (file.fail()) {
    throw std::runtime_error("Unable to open the file: " + filename);
  }
  std::streamsize size = file.tellg();
  file.seekg(0, std::ios::beg);

  buffer.resize(size, 0);

  if (!file.read(reinterpret_cast<char*>(buffer.data()), size)) {
    throw std::runtime_error("Failed to read the file: " + filename);
  }
}

void readStdinToBuffer(std::vector<uint8_t>& buffer) {
  alignas(uint64_t)std::istreambuf_iterator<char> begin(std::cin), end;
  alignas(uint64_t)std::vector<char> inputChars(begin, end);
  buffer = std::vector<uint8_t>(inputChars.begin(), inputChars.end());
}

int main(int argc, char **argv) {
  alignas(uint64_t)std::vector<uint8_t> buffer;
  std::string filename;
  std::string outputFilename;
  bool infiniteMode = false;
  FILE* outputFile = stdout; // Default to stdout
  int outputWords = 4;

  // Handle flags and arguments
  for (int i = 1; i < argc; i++) {
    if (strcmp(argv[i], "--infinite") == 0) {
      infiniteMode = true;
    } else if (strcmp(argv[i], "--outfile") == 0) {
      if (i + 1 < argc) {
        outputFilename = argv[++i];
        outputFile = fopen(outputFilename.c_str(), "wb");
        if (!outputFile) {
          std::cerr << "Error: Unable to open output file: " << outputFilename << std::endl;
          return EXIT_FAILURE;
        }
      } else {
        std::cerr << "Error: --outfile option requires a filename argument." << std::endl;
        return EXIT_FAILURE;
      }
    } else if (strcmp(argv[i], "--words") == 0) {
      if (i + 1 < argc) {
        outputWords = std::stoi(argv[++i]);
        if (outputWords < 1 || outputWords > 4) {
          std::cerr << "Error: --words option requires an integer between 1 and 4." << std::endl;
          return EXIT_FAILURE;
        }
      } else {
        std::cerr << "Error: --words option requires an integer argument." << std::endl;
        return EXIT_FAILURE;
      }
    } else {
      filename = argv[i];
    }
  }

  if (infiniteMode && outputFile == stdout) {
    SET_BINARY_MODE(stdout);
  }

  bool readFromFile = !filename.empty() && filename != "-";
  if (readFromFile) {
    readFileToBuffer(filename, buffer);
  } else {
    readStdinToBuffer(buffer);
  }

  // Buffer to store the hash output
  std::vector<uint64_t> hash(4);

  // Sponge construction
  if (infiniteMode) {
    std::vector<uint8_t> input = buffer;
    while (true) {
      BEBB4185_64(input.data(), input.size(), 0, hash.data());
      std::fwrite(hash.data(), sizeof(uint64_t), 4, outputFile);
      std::fflush(outputFile); // make sure it's written
      // Reuse the same memory buffer as input for the next iteration
      std::memcpy(input.data(), hash.data(), sizeof(uint64_t) * 4);
    }
  } else {
    BEBB4185_64(buffer.data(), buffer.size(), 0, hash.data());
    for (int i = 0; i < outputWords; ++i) {
      fprintf(outputFile, "%016" PRIx64, hash[i]);
    }
    fprintf(outputFile, " %s\n", filename.c_str());
  }

  // Close the output file if it's not stdout
  if (outputFile != stdout) {
    fclose(outputFile);
  }

  return EXIT_SUCCESS;
}

@o0101
Copy link
Contributor Author

o0101 commented Jun 30, 2023

for illustration of exactly what i meant

@o0101
Copy link
Contributor Author

o0101 commented Jun 30, 2023

then i'd run:

$ gzip -vv -9 <outfile>

or

$ lzma -9 -z -v <outfile>

@avaneev
Copy link
Contributor

avaneev commented Jul 12, 2023

Just a note - it looks like neural network-based compressors work great for text compression, but I do not think they'll handle "random data" compression well.

@o0101
Copy link
Contributor Author

o0101 commented Jul 12, 2023

@rurban would it be a bad idea to integrate a "simpler to integrate" compression library (like @avaneev's suggested lzma) at first to figure things out, and move on to a more involved NN compression thing later if needed?

Anyway man I know you're super busy so no worries if you don't get to this, just thank you for reading this far! :)

@rurban
Copy link
Owner

rurban commented Jul 12, 2023

lzma compresses/measures only linear repeatability, which is much less than simple linear transformations, not speaking about polynomial patterns, with the simpliest being multi-dimensional rotations, translations, scalings. Only NN (beside visual inspection) can detect proper patterns, never a primitive LZ compressor.
I"ve tried zpac and paq8px (http://mattmahoney.net/dc/) a few years ago, but better NN's are needed. Maybe https://github.com/mohit1997/Dzip-torch

@avaneev
Copy link
Contributor

avaneev commented Jul 12, 2023

NN compressors are language-based models mainly. LZMA on the other hand works with bit-level patterns and is Markov chain based, it's not "linear repeatability".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants