Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #1

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

victor-istomin
Copy link

Hello,

I've tried this logger and found it pretty light and useful. However, I've found multi-threading concurrency bottleneck, so I'd like to improve this a bit.
Test code sample and benchmarking result will be provided below.

  1. Mutex lock is not necessary when performing line formatting and conversion to UTF-8, so I have moved string manipulation out of the locked scope. This gave more concurrency level and about 3 times more performance on Windows with my test sample. Unfortunately, this change won't improve things on Linux.
  2. Investigating at Linux I've found that application is still running "in one thread". Profiler shows spending a lot of time for mutex lock/unlock, not for disk IO. The problem is inside pthread_mutex_lock, which has too small spinlock iterations count before entering sleep state. I've added spinlock iterations manually, so code became parallel and 3 times faster in my sample in Linux too.
  3. Omitting several ANSI/UTF-16 conversions and changing 'stream << char[1]' to 'stream << char' gave me another few percent of performance. This just removes redundant strlen() call on char[1]. Additionally strftime is about 30% faster on Windows than manual stringstream conversion, and it looks simpler.

Multithreading test code

    void threadFunction()
    {
        for (int i = 0; i < 800000; ++i)
            LOG(plog::debug) << "i=" << i << "clock:" << clock();
    }

    int main(int argc, char* argv[])
    {
        plog::init(plog::debug, "Hello.txt");
        std::vector<std::thread> threads(4);

        clock_t start = clock();

        for (size_t i = 0; i < threads.size(); ++i)
            threads[i] = std::thread(threadFunction);

        for (std::thread& thread : threads)
            thread.join();

        clock_t finish = clock();

        std::cout << "Time: " << finish - start << std::endl;

        return 0;
    }

Benchmarking results for Linux

Test system is Core i5-4300M with SSD. OS: Windows 10 host, VM: Unubtu Linux 14.04.2.

Original branch

top/iotop results: Disk write speed 4-5 MB/s, CPU 25-30% total (one core at 100%).

  victor@ubuntu:~/plog$ time ./test_master_o2
  Time: 106639234

  real  0m58.393s
  user  0m4.579s
  sys   1m42.063s

Performance improvement branch

top/iotop results: Disk write speed 17-22 MB/s, CPU 100% total

  victor@ubuntu:~/plog$ time ./test_perf_o2_3
  Time: 39805657

  real  0m11.434s
  user  0m16.325s
  sys   0m23.483s

Benchmarking results for Windows are very similar.

Thanks for review and feel free to ask additional questions if needed.
Victor.

@SergiusTheBest
Copy link
Owner

Interesting findings, great thanks.

@stephane-martin
Copy link

AFAIK pthread_yield does not exist on MacOSX.

cd /usr/include/pthread
grep -r yield *
pthread.h:void pthread_yield_np(void);
sched.h:extern int sched_yield(void);

@SergiusTheBest
Copy link
Owner

@stephane-martin

Yes, the linux man page says about pthread_yield:

This call is nonstandard, but present on several other systems. Use
the standardized sched_yield instead

Thanks for noticing it.

@SergiusTheBest
Copy link
Owner

I remember about performance improvements made by @victor-istomin and I also made some additional profiling and research. The results are very sad:

  • std::mutex and pthread_mutex implementations don't use spinning by default
  • C/C++ string conversion functions are very slow

I tried to use custom string conversion functions and got a near 5x performance boost!

SergiusTheBest pushed a commit that referenced this pull request Aug 20, 2023
Mutex lock is not necessary when performing line formatting and conversion to UTF-8, so I have moved string manipulation out of the locked scope. This gave more concurrency level and about 3 times more performance on Windows with my test sample.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants