improve the performance `company-etags` #903

redguardtoo · 2019-07-26T02:19:41Z

Per discussion #877

There is much room for improvement:

Use producer/consumer pattern, consumer only searches matches in candidate list. It should know nothing about tags file. The producer is responsible to convert tags file into candidate list (maybe in a different thread if possible)
All the tag names are assigned to different partitions, each partition only contains the string with same first character
As I mentioned before, we could use CLI program diff to get the minimum update patch for the candidate list.
More efficient APIs and 3rd party CLI tools should be used. For example, if candidate list is just a plain string containing lines, cli program sort could be used to sort lines, which might be faster than sort a Lisp list of strings.
build a cache system. Like cache for CPU. Say user inputs "get" to get all the tags starting with "get". It's very possible she still needs the same tags next time.
cache the candidates. For example, when user input test1 to search all tag names start with test1, it's very possible candidates start with tes is already returned, why not cache the tes result in a variable. So search test1 could re-use tes result.

@dgutov , if you are interested, I can send you a new pull request asap.

The text was updated successfully, but these errors were encountered:

redguardtoo · 2019-07-26T14:43:45Z

Good news. Looks (tags-completion-table), which is used by company-etags right now, is actually not inefficient at all.

Tested with Linux kernel, My new simple string match algorithm take 6 seconds (read the file and build the list) ,while tags-completion-table takes 120 seconds.

dgutov · 2019-08-09T18:00:58Z

is actually not inefficient at all

Not inefficient, or not efficient? If it's the latter, do you have a smaller patch for tags-completion-table which can be contributed to the core? I'm sure other users of this function would like to benefit as well.

I'm a bit skeptical, though, considering I've looked into its performance a couple of years ago.

redguardtoo · 2019-08-10T06:08:43Z

See my pull request, when I started coding, I changed the design a little bit.

Some guy has being used my patched version for two weeks and the feedback is very positive.

See https://emacs-china.org/t/c-emacs/10068/11 in Chrome, and use google translator to translate the page.

Here is user's translated comment,

I' used similar partition algorithm on tumashu/pyim#277 before working on company. So it's very mature. The key idea comes from mysql's partition algorithm.

dgutov · 2019-08-10T08:19:22Z

See my pull request

That's not something that can be applied to Emacs, though.

redguardtoo · 2019-08-12T03:57:51Z

See my pull request

That's not something that can be applied to Emacs, though.

I'm not sure what you mean. The point 2, 3, 6 from first post is used in final implementation. Others are not implemented. I could explain with more details if you have any questions.

dgutov · 2019-08-12T13:06:59Z

I'm not sure what you mean

A patch against Emacs core?

dgutov · 2019-08-12T13:07:54Z

You said that your "new simple string match algorithm" speeds things up by itself (aside from the other improvements). Or that's at least how I understood it.

redguardtoo · 2019-08-13T01:22:29Z

It's a pure lisp algorithm which is part of company-etags. Let me explain technical details of this algorithm.

Old code is like (all-completions "a" '("a1" "a2" "b1" "b2")). The prefix "a" need be compared with "b1", "b2" anyway.

The new code is like,

(setq prefix "a")
(cond
 ((eq (elt prefix 0) ?a)
  (all-completions prefix '("a1" "a2")))
 ((eq (elt prefix 0) ?b)
  (all-completions prefix '("b1" "b2"))))

I divide '("a1" "a2" "b1" "b2") into two partitions '("a1" "a2"), '("b1" "b2"). The all-completions only applies on one partition.

In Linux kernel, there are 3 millions tag names. I divide them into 52 partitions (a..z, A..Z). For one given prefix, only one partition is searched and the other 51 partitions are skipped. So 2,942,307 string comparing ( 3,000,000 * 51 / 52) is skipped.

That's the algorithm implemented in company-etags only.

I did say before I could speed up the code from Emacs core. But later I gave up the idea after investigation (like many of my other ideas).

dgutov · 2019-08-13T13:33:01Z

OK, so it's not string matching, it's matching with partitioning. I'm familiar with the idea.

It doesn't speed up the creation of the completion table at all, right? If anything, it might make that phase longer. It can speed subsequent completion, though.

Do you have any performance numbers, with and without partitioning?

later I gave up the idea after investigation

Because..?

redguardtoo · 2019-08-14T01:17:35Z

It doesn't speed up the creation of the completion table at all, right? If anything, it might make that phase longer. It can speed subsequent completion, though.

"speed up the creation of the completion table" is another algorithm. Let me explain the technical details.

In old code,etags-tags-completion-table is actually used to create completion table.

(defun etags-tags-completion-table () ; Doc string?
  (let (table
	(progress-reporter
	 (make-progress-reporter
	  (format "Making tags completion table for %s..." buffer-file-name)
	  (point-min) (point-max))))
    (save-excursion
      (goto-char (point-min))
      ;; This regexp matches an explicit tag name or the place where
      ;; it would start.
      (while (re-search-forward
              "[\f\t\n\r()=,; ]?\177\\\(?:\\([^\n\001]+\\)\001\\)?"
	      nil t)
	(push	(prog1 (if (match-beginning 1)
			   ;; There is an explicit tag name.
			   (buffer-substring (match-beginning 1) (match-end 1))
			 ;; No explicit tag name.  Backtrack a little,
                         ;; and look for the implicit one.
                         (goto-char (match-beginning 0))
                         (skip-chars-backward "^\f\t\n\r()=,; ")
                         (prog1
                             (buffer-substring (point) (match-beginning 0))
                           (goto-char (match-end 0))))
		  (progress-reporter-update progress-reporter (point)))
		table)))
    table))

It's slow simply because of overhead from extra layer buffer.

For example, consider goto-char which is called inside loop.

Here is its definition:

  if (MARKERP (position))
    set_point_from_marker (position);
  else if (FIXNUMP (position))
    SET_PT (clip_to_bounds (BEGV, XFIXNUM (position), ZV));
  else
    wrong_type_argument (Qinteger_or_marker_p, position);
  return position;

SET_PT is basically a macro which calls set_point, set_point is a function calls set_point_both. set_point_both is defined in intervals.c and it contains 200 lines!

In my new code, goto-char is replaced with a simple arithmetic plus.

The key algorithm is actually just 4 lines,

(while (string-match pattern text start)
  (company-etags-push-tagname (substring text (match-beginning 1) (match-end 1))
                              tagname-dict)
   (setq start (+ 4 (match-end 0))))

Then I found my code is fast enough. Besides, it's Lisp and more flexible than C. So I gave up on submitting patch to Emacs core team.

I can give your the performance number with or without partitioning later.

dgutov · 2019-08-14T08:33:53Z

The key algorithm is actually just 4 lines

I'd really like to see the performance numbers from that change alone.

redguardtoo · 2019-08-14T12:57:51Z

Code to test linux kernel (tags file is 202M):

(let* ((gc-cons-threshold most-positive-fixnum))
  (message "%S vs %S"
           (benchmark-run-compiled 1
             (setq tags-file-name "~/projs/linux-master/TAGS")
             (visit-tags-table-buffer)
             (etags-tags-completion-table))
           (benchmark-run-compiled 1
             (setq raw-content (with-temp-buffer
                                 (insert-file-contents "~/projs/linux-master/TAGS")
                                 (buffer-string)))
             (company-etags-extract-tagnames raw-content))))

Result:

Link of full test video, https://youtu.be/9t29LfYx10w

dgutov · 2019-08-14T15:41:06Z

I wonder how much of that speedup comes from the removal of progress-reporter-update, and how much of it comes from no support for implicit tag names. Or am I mistaken, and you somehow support the latter?

redguardtoo · 2019-08-15T00:31:02Z

No I don't support implicit tag names. I drop support for etags. Everybody is using Ctags these days.

dgutov · 2019-08-15T08:31:47Z

So if it's ctags, why the file name is TAGS? And should this still be called company-etags?

redguardtoo · 2019-08-15T10:35:00Z

The file name is created by ctags. Quote from ctags manual,

  -e   Output tag file for use with Emacs.
  -f <name>
       Write tags to specified file. Value of "-" writes tags to stdout
       ["tags"; or "TAGS" when -e supplied].

I think the name of etags is part of tradition, we should keep it. See https://en.wikipedia.org/wiki/Ctags#Etags

Currently Exuberant Ctags is the default standard. But Universal Ctags is gaining popularity. These two are compatible

Other tag programs are based on ctags (https://github.com/dan-t/rusty-tags , for example).

dgutov · 2019-08-15T11:29:18Z

OK, but if we're using the etags format, we have to support it fully.

Does ctags not ever output implicit tags?

Anyway, etags currently outputs better tags for Ruby. So discarding it is not an option.

redguardtoo · 2019-08-15T11:36:30Z

Could have two modes,

default is etags mode, which supports both etags and ctags. It's slower.
ctags mode, which only supports ctags. It's faster.

The problem of make etags mode default is most users don't bother switching to ctags mode even when they are using ctags only. So the best solution is to make ctags mode default and give user some warning when detecting tags file generated by etags.

The only problem is I don't know anybody using etags. Even for Ruby, https://www.google.com/search?q=ctags+ruby&oq=ctags+ruby

So maybe not worth the effort. I understand etags can produce two kinds of tag lines. Verbose kind is the format used by both etags and ctags. The compact kind (implicit tag name) is used only by etags.

redguardtoo · 2019-08-19T02:29:48Z

@dgutov , check my latest code. Now support both etags and ctags. User can also (setq company-etags-support-ctags-only t) to support ctags only.

Besides, user can customize tags file name now.

dgutov · 2019-08-19T13:26:17Z

The variable name could use some work, but ok.

How's the performance in the "slow" case?

redguardtoo · 2019-08-20T01:25:12Z

Benchmark code,

;; Test in Linux kernel code
(let* ((gc-cons-threshold most-positive-fixnum))
  (message "%S vs %S"
           (benchmark-run-compiled 1
             ;; Run `find . -name "*.[ch]" | etags -` to create tags file
             (setq raw-content (with-temp-buffer
                                 (insert-file-contents "~/projs/linux-master/TAGS.etags")
                                 (buffer-string)))
             (setq company-etags-support-ctags-only nil)
             (company-etags-extract-tagnames raw-content))
           (benchmark-run-compiled 1
             ;; Run `find . -name "*.[ch]" | ctags -e -L -` to create tags file
             (setq raw-content (with-temp-buffer
                                 (insert-file-contents "~/projs/linux-master/TAGS")
                                 (buffer-string)))
             (setq company-etags-support-ctags-only t)
             (company-etags-extract-tagnames raw-content))))

Result：
(32.571287879 0 0.0) vs (11.488479926 0 0.0)

redguardtoo · 2019-08-22T00:17:06Z

@dgutov , Anything else I need do before merging?

redguardtoo · 2019-09-07T13:46:51Z

@dgutov, I noticed you reverted my pull request. What else I need to do to get my code merged? I could re-create a new pull request if required.

dgutov · 2019-09-08T00:54:09Z

For posterity: see the linked PR.

redguardtoo mentioned this issue Jul 27, 2019

speed up company-etags #904

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve the performance `company-etags` #903

improve the performance `company-etags` #903

redguardtoo commented Jul 26, 2019 •

edited

redguardtoo commented Jul 26, 2019

dgutov commented Aug 9, 2019

redguardtoo commented Aug 10, 2019 •

edited

dgutov commented Aug 10, 2019

redguardtoo commented Aug 12, 2019

dgutov commented Aug 12, 2019

dgutov commented Aug 12, 2019 •

edited

redguardtoo commented Aug 13, 2019 •

edited

dgutov commented Aug 13, 2019

redguardtoo commented Aug 14, 2019 •

edited

dgutov commented Aug 14, 2019

redguardtoo commented Aug 14, 2019 •

edited

dgutov commented Aug 14, 2019

redguardtoo commented Aug 15, 2019 •

edited

dgutov commented Aug 15, 2019

redguardtoo commented Aug 15, 2019 •

edited

dgutov commented Aug 15, 2019

redguardtoo commented Aug 15, 2019 •

edited

redguardtoo commented Aug 19, 2019

dgutov commented Aug 19, 2019

redguardtoo commented Aug 20, 2019 •

edited

redguardtoo commented Aug 22, 2019

redguardtoo commented Sep 7, 2019 •

edited

dgutov commented Sep 8, 2019

improve the performance company-etags #903

improve the performance company-etags #903

Comments

redguardtoo commented Jul 26, 2019 • edited

redguardtoo commented Jul 26, 2019

dgutov commented Aug 9, 2019

redguardtoo commented Aug 10, 2019 • edited

dgutov commented Aug 10, 2019

redguardtoo commented Aug 12, 2019

dgutov commented Aug 12, 2019

dgutov commented Aug 12, 2019 • edited

redguardtoo commented Aug 13, 2019 • edited

dgutov commented Aug 13, 2019

redguardtoo commented Aug 14, 2019 • edited

dgutov commented Aug 14, 2019

redguardtoo commented Aug 14, 2019 • edited

dgutov commented Aug 14, 2019

redguardtoo commented Aug 15, 2019 • edited

dgutov commented Aug 15, 2019

redguardtoo commented Aug 15, 2019 • edited

dgutov commented Aug 15, 2019

redguardtoo commented Aug 15, 2019 • edited

redguardtoo commented Aug 19, 2019

dgutov commented Aug 19, 2019

redguardtoo commented Aug 20, 2019 • edited

redguardtoo commented Aug 22, 2019

redguardtoo commented Sep 7, 2019 • edited

dgutov commented Sep 8, 2019

improve the performance `company-etags` #903

improve the performance `company-etags` #903

redguardtoo commented Jul 26, 2019 •

edited

redguardtoo commented Aug 10, 2019 •

edited

dgutov commented Aug 12, 2019 •

edited

redguardtoo commented Aug 13, 2019 •

edited

redguardtoo commented Aug 14, 2019 •

edited

redguardtoo commented Aug 14, 2019 •

edited

redguardtoo commented Aug 15, 2019 •

edited

redguardtoo commented Aug 15, 2019 •

edited

redguardtoo commented Aug 15, 2019 •

edited

redguardtoo commented Aug 20, 2019 •

edited

redguardtoo commented Sep 7, 2019 •

edited