Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

Wrong state of SquashedTable in SHOW PROCESSLIST #658

Open
smacker opened this issue Dec 4, 2018 · 11 comments
Open

Wrong state of SquashedTable in SHOW PROCESSLIST #658

smacker opened this issue Dec 4, 2018 · 11 comments
Labels
blocked Some other issue is blocking this bug Something isn't working

Comments

@smacker
Copy link
Contributor

smacker commented Dec 4, 2018

6639/113 doesn't look correct:

Command: query
   Time: 2373
  State: SquashedTable(ref_commits, commits, commit_files)(6639/113), blobs(0/113)
   Info: SELECT DISTINCT LANGUAGE(cf.file_path, b.blob_content) AS lang
FROM blobs b
NATURAL JOIN commit_files cf
NATURAL JOIN commits c
NATURAL JOIN ref_commits r
WHERE r.ref_name = 'HEAD'
AND r.history_index = 0
AND is_binary(b.blob_content) = false
AND cf.file_path NOT REGEXP '^vendor.*'
AND lang != ""
@ajnavarro
Copy link
Contributor

ajnavarro commented Dec 4, 2018

Could you provide more info? How did you reproduce it? how many repositories did you have? did you modify the repositories folder after launch gitbase?

@smacker
Copy link
Contributor Author

smacker commented Dec 4, 2018

I had 113 repositories. All public repos of src-d organization.

curl https://api.github.com/orgs/src-d/repos?per_page=100 > repos_list.json
mkdir -p repos
for repo in `cat repos_list.json | jq -r '.[] | .clone_url'`
do
    name=`echo ${repo%.git} | cut -d '/' -f 5`
    git clone "$repo" "repos/$name"
done

(and then repeat it with page=2)

Gitbase was restarted after cloning finished and I didn't modify repos anymore.
The query to reproduce is visible in the output of SHOW PROCESSLIST above.

I hope it would help!

@ajnavarro
Copy link
Contributor

ajnavarro commented Dec 4, 2018

which version did you use? I cannot reproduce the error with the latest 0.18.0-beta.3

@smacker
Copy link
Contributor Author

smacker commented Dec 4, 2018

$ ./gitbase version
gitbase (v0.18.0-beta.3) - build 11-27-2018_11_31_37

Easily reproducible for me. Tried with empty everything.

Use the script above to get repositories.

Then:

mysql> select count(*) from repositories;
+----------+
| COUNT(*) |
+----------+
|      100 |
+----------+
1 row in set (0.01 sec)

mysql> SELECT DISTINCT LANGUAGE(cf.file_path, b.blob_content) AS lang
    -> FROM blobs b
    -> NATURAL JOIN commit_files cf
    -> NATURAL JOIN commits c
    -> NATURAL JOIN ref_commits r
    -> WHERE r.ref_name = 'HEAD'
    -> AND r.history_index = 0
    -> AND is_binary(b.blob_content) = false
    -> AND cf.file_path NOT REGEXP '^vendor.*'
    -> AND lang != ""
    -> ;

PROCESSLIST shows 299/100.

mysql> SHOW PROCESSLIST \G;
*************************** 1. row ***************************
     Id: 3
   User: root
   Host: 127.0.0.1:52412
     db: gitbase
Command: query
   Time: 25
  State: SquashedTable(ref_commits, commits, commit_files)(299/100), blobs(0/100)
   Info: SELECT DISTINCT LANGUAGE(cf.file_path, b.blob_content) AS lang
FROM blobs b
NATURAL JOIN commit_files cf
NATURAL JOIN commits c
NATURAL JOIN ref_commits r
WHERE r.ref_name = 'HEAD'
AND r.history_index = 0
AND is_binary(b.blob_content) = false
AND cf.file_path NOT REGEXP '^vendor.*'
AND lang != ""
*************************** 2. row ***************************
     Id: 6
   User: root
   Host: 127.0.0.1:52412
     db: gitbase
Command: query
   Time: 0
  State: running
   Info: SHOW PROCESSLIST
2 rows in set (0.00 sec)

ERROR:
No query specified

After few minutes 1399/100:

*************************** 2. row ***************************
     Id: 3
   User: root
   Host: 127.0.0.1:52412
     db: gitbase
Command: query
   Time: 125
  State: SquashedTable(ref_commits, commits, commit_files)(1399/100), blobs(0/100)
   Info: SELECT DISTINCT LANGUAGE(cf.file_path, b.blob_content) AS lang
FROM blobs b
NATURAL JOIN commit_files cf
NATURAL JOIN commits c
NATURAL JOIN ref_commits r
WHERE r.ref_name = 'HEAD'
AND r.history_index = 0
AND is_binary(b.blob_content) = false
AND cf.file_path NOT REGEXP '^vendor.*'
AND lang != ""

@erizocosmico erizocosmico self-assigned this Dec 5, 2018
@erizocosmico
Copy link
Contributor

The PR that enables auto in memory joins on small tables will mitigate this to a certain extent, but I don't think this can be solved in any way as long as we do multi pass on joins. There's just no way to know ahead of time the maximum number. It's a very confusing thing, but not one we can solve to the best of my knowledge.

Perhaps we should think of a way to display progress that will not incur in this. What if instead of showing X/Y we show just the number of partitions iterated? You will not know when it will end, but it serves the purpose of knowing if it's stuck and where it's stuck more or less. If we keep it this way, 90% of the time (when it's not in-memory) it's not gonna work.

@kuba--
Copy link
Contributor

kuba-- commented Mar 21, 2019

IMO, any changing counter which can show some progress would be good enough (and maybe less confusing than proportion)

@erizocosmico erizocosmico removed their assignment Mar 27, 2019
@kuba-- kuba-- self-assigned this Apr 4, 2019
@kuba-- kuba-- transferred this issue from src-d/gitbase Apr 4, 2019
@kuba-- kuba-- added the bug Something isn't working label Apr 4, 2019
@kuba-- kuba-- closed this as completed May 24, 2019
@kuba-- kuba-- reopened this May 24, 2019
@kuba-- kuba-- removed their assignment May 24, 2019
@kuba--
Copy link
Contributor

kuba-- commented May 30, 2019

Any ideas?

@erizocosmico
Copy link
Contributor

It just can't be done if join is performed in multipass. Only solution is to not display (x/y), but only x

@ajnavarro
Copy link
Contributor

Could be possible to hide y only if join cannot be done on memory?

@erizocosmico
Copy link
Contributor

erizocosmico commented Jun 3, 2019

It starts in in-memory mode and then switches to multipass during execution, so you don't know until then. And the switch is done per join on the join, so the table has no knowledge about that.

@ajnavarro
Copy link
Contributor

I think in most of the cases having the total number of partitions is really useful, so I would rather keep it, even if sometimes can be misleading. We can add some documentation explaining what those numbers mean.

@erizocosmico erizocosmico added the blocked Some other issue is blocking this label Jun 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked Some other issue is blocking this bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants