Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create a string longer than 0x1fffffe8 characters when using data-persistence in server #554

Open
imertz opened this issue Nov 20, 2023 · 5 comments

Comments

@imertz
Copy link

imertz commented Nov 20, 2023

Describe the bug

When trying to persist a large amount of data using the persistToFile function, Node.js throws an error: Cannot create a string longer than 0x1fffffe8 characters. This error is due to the V8 engine's limitation on string size.

To Reproduce

  1. Create a large dataset (larger than the V8 string size limit).
  2. Try to persist this data using the persistToFile function.

Expected behavior

The data should be successfully persisted to the file without any errors.

Environment Info

OS: MacOS
Node: 20.7.0
Orama: 2.0.0 beta7

Affected areas

Data Insertion

Additional context

Possible Solution:

Consider implementing a streaming approach to write the data to the file, which would avoid having to convert the entire Buffer to a string at once.

@micheleriva
Copy link
Member

Thanks for opening this. @allevo we should rework the persistence plugin if we can reproduce this

@allevo
Copy link
Collaborator

allevo commented Dec 1, 2023

Hi @imertz !
Have you tried with a different format? For instance https://docs.oramasearch.com/open-source/plugins/plugin-data-persistence#persisting-the-database-to-disk-server-usage . Will that fit your case?

@imertz
Copy link
Author

imertz commented Dec 6, 2023

Hi @imertz ! Have you tried with a different format? For instance https://docs.oramasearch.com/open-source/plugins/plugin-data-persistence#persisting-the-database-to-disk-server-usage . Will that fit your case?

I'll try it out and come back to you.

@valstu
Copy link
Contributor

valstu commented Feb 6, 2024

IIRC dpack worked for persiting the file to disk but if the file is larger than 512mb the restoreFromFile won't work. This mainly because of all implementations for restore rely on toString() method at some point. Which means it tries to create string over 512mb. So while writing to/restoring from file, it should be read with fs.createReadStream and written with fs.createWriteStream.

Here's a naive implementation with streaming support for Node.js with @msgpack/msgpack (basically the current binary format solution with streaming support):

import type { AnyOrama, RawData } from '@orama/orama';
import { create, load, save } from '@orama/orama';
import fs from 'fs';
import { decode, encode } from '@msgpack/msgpack';

export const persistToFile = async (
  db: AnyOrama,
  outputFile: string,
) => {

  const dbExport = await save(db);
  const msgpack = encode(dbExport);
  const serialized = Buffer.from(
    msgpack.buffer,
    msgpack.byteOffset,
    msgpack.byteLength,
  );

  const writeStream = fs.createWriteStream(outputFile);
  const chunkSize = 1024;
  for (let i = 0; i < serialized.length; i += chunkSize) {
    const end = Math.min(i + chunkSize, serialized.length);
    const chunk = serialized.slice(i, end);
    const hexChunk = chunk.toString('hex');
    writeStream.write(hexChunk);
  }
  writeStream.end();

  writeStream.on('finish', () => {
    console.log('File has been written as', outputFile);
  });
};

const deserialize = async (inputFile: string) => {
  return new Promise<RawData>((resolve, reject) => {
    const readStream = fs.createReadStream(inputFile, {
      encoding: 'utf8',
      // highWaterMark: 1024,
    });
    const chunks: Buffer[] = [];
    readStream.on('data', (chunk: string) => {
      chunks.push(Buffer.from(chunk, 'hex'));
    });

    readStream.on('end', () => {
      const combinedBuffer = Buffer.concat(chunks);
      const decodedData = decode(Buffer.from(combinedBuffer));
      resolve(decodedData as RawData);
    });
    readStream.on('error', (err) => {
      reject(err);
    });
  });
};

export const restoreFromFile = async (inputFile: string) => {
  const deserialized = await deserialize(inputFile);
  const db = await create({
    schema: {
      __placeholder: 'string',
    },
  });
  await load(db, deserialized);
  return db;
};

Disclaimer: I extracted these functions from from larger codebase so I haven't actually ran this exact piece of code but hopefully this helps. Also not sure if the chunking part on persistToFile function is the way to got but it worked for me.

@valstu
Copy link
Contributor

valstu commented Feb 12, 2024

We also noticed that you can write the msgpack encoded binary directly to file instead of turning it to hex before writing. This makes the msp file half the size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants