Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize update script #115

Open
wants to merge 3 commits into
base: gh-pages
Choose a base branch
from

Conversation

dflupu
Copy link

@dflupu dflupu commented Apr 30, 2022

This addresses ethereum/solidity#12421

It does so by adding the workerpool node module, moving a bunch of code into a separate worker file and running that code in the worker pool. It also adds a --max-worker-processes flag to configure the number of workers.

Edit: Missed a pool.terminate() call initially. Had to do a bit of refactoring in order to be able to throw that in. As it was, there was no way to know when all of the async branches were done. This last commit makes the diff quite a bit more unreadable but that was rather unavoidable. The script is a bit dated and uses a combination of callbacks and async/await.

@axic axic requested a review from cameel June 1, 2022 11:28
Copy link
Member

@r0qs r0qs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for your contribution. I only have some comments :)

Comment on lines +14 to +15
const readFile = util.promisify(fs.readFile)
const readdir = util.promisify(fs.readdir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use fs.Promises instead?

E.g.

const { readdir, readFile } = require('node:fs/promises')
// ...
const files = await readdir(path)

)
const pool = workerpool.pool('./update-worker.js', {
maxWorkers: options.maxWorkerProcesses,
// Threads would be ideal, but do not work as expected due to an issue with the swarmhash module
Copy link
Member

@r0qs r0qs Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What issue? Could you provide more information about the problem using the thread workerType?

pool.terminate()
}

main();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the extra semicolon. Also, please remember to run npm run lint -- --fix

Suggested change
main();
main()

const latestRelease = parsedList
.slice()
.reverse()
.filter(function (listEntry) { return listEntry.prerelease === null })
Copy link
Member

@r0qs r0qs Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you changed the behavior of the processDir function. Some entries that do not have a prerelease now contains the prerelease field set to null in the final list.json instead of not having the field. The prerelease field must not be set in the parsedList when there is not prerelease.

For example, this one:

    {
      "path": "soljson-v0.1.1+commit.6ff4cd6.js",
      "version": "0.1.1",
      "prerelease": null,
      "build": "commit.6ff4cd6",
      "longVersion": "0.1.1+commit.6ff4cd6",
      "keccak256": "0xd8b8c64f4e9de41e6604e6ac30274eff5b80f831f8534f0ad85ec0aff466bb25",
      "sha256": "0xfc54135edfe07d835216f87fa6c3b9049e1415c2825027b926daad018e09269f",
      "urls": [
        "bzzr://8f3c028825a1b72645f46920b67dca9432a87fc37a8940a2b2ce1dd6ddc2e29b",
        "dweb:/ipfs/QmPPGxsMtQSEUt9hn9VAtrtBTa1u9S5KF1myw78epNNFkx"
      ]
    },

should be without the prerelease field:

    {
     "path": "soljson-v0.1.1+commit.6ff4cd6.js",
     "version": "0.1.1",
     "build": "commit.6ff4cd6",
     "longVersion": "0.1.1+commit.6ff4cd6",
     "keccak256": "0xd8b8c64f4e9de41e6604e6ac30274eff5b80f831f8534f0ad85ec0aff466bb25",
     "sha256": "0xfc54135edfe07d835216f87fa6c3b9049e1415c2825027b926daad018e09269f",
     "urls": [
       "bzzr://8f3c028825a1b72645f46920b67dca9432a87fc37a8940a2b2ce1dd6ddc2e29b",
       "dweb:/ipfs/QmPPGxsMtQSEUt9hn9VAtrtBTa1u9S5KF1myw78epNNFkx"
     ]
   },

const batchPromises = []

for (let i = 0; i < values.length; i += batchSize) {
batchPromises.push(pool.exec('workerMain', [dir, values.slice(i, i + batchSize), oldList]))
Copy link
Member

@r0qs r0qs Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run any profiling to see how your changes improved the performance of the current code? The execution time of your implementation does not seem to have any significant improvement in my machine:
Current implementation: 702.21s user 16.80s system 106% cpu 11:15.54 total
PR implementation: 670.77s user 17.92s system 104% cpu 11:01.69 total

Maybe we could also move some of the I/O operations to the workers instead of only moving the hash operations? Like, we could have a worker per binary directory and then multiple child-workers per-directory (each one reading/processing a subset of files in each directory). The main worker thread in each directory would then aggregate the "child/file workers" results and write the respective list.[json,js,txt] files or something like that. What do you think?

I ran a quick profiling here, which seems to hint that your code spends most of the time waiting for file descriptors to become ready. Not sure if this is actually the case though:

PR code:

[C++]:
   ticks  total  nonlib   name
   1987   71.3%   93.9%  epoll_pwait@@GLIBC_2.6
     21    0.8%    1.0%  __write@@GLIBC_2.2.5
     14    0.5%    0.7%  fwrite@@GLIBC_2.2.5
     12    0.4%    0.6%  recvmsg@@GLIBC_2.2.5
     11    0.4%    0.5%  std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
      9    0.3%    0.4%  std::ostream::sentry::sentry(std::ostream&)
      8    0.3%    0.4%  __libc_malloc@@GLIBC_2.2.5
      
      
[Summary]:
   ticks  total  nonlib   name
     37    1.3%    1.7%  JavaScript
   2078   74.5%   98.3%  C++
     72    2.6%    3.4%  GC
    673   24.1%          Shared libraries

Current code:

[C++]:
   ticks  total  nonlib   name
  18358    2.9%    4.9%  epoll_pwait@@GLIBC_2.6
    901    0.1%    0.2%  __dynamic_cast
    793    0.1%    0.2%  __cxxabiv1::__vmi_class_type_info::__do_dyncast(long, __cxxabiv1::__class_type_info::__sub_kind, __cxxabiv1::__class_type_info const*, void const*, __cxxabiv1::__class_type_info const*, void const*, __cxxabiv1::__class_type_info::__dyncast_result&) const
    711    0.1%    0.2%  __libc_malloc@@GLIBC_2.2.5

[Summary]:
   ticks  total  nonlib   name
   1923    0.3%    0.5%  JavaScript
  24553    3.9%    6.5%  C++
  12993    2.1%    3.5%  GC
  256700   40.5%          Shared libraries
  349903   55.3%          Unaccounted

I think it would be beneficial to have some benchmark showing the performance improvement of your implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants