Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Performance problems with queries to pg_catalog #57249

Closed
gyrter opened this issue Nov 30, 2020 · 8 comments · Fixed by #57542
Closed

sql: Performance problems with queries to pg_catalog #57249

gyrter opened this issue Nov 30, 2020 · 8 comments · Fixed by #57542
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. X-blathers-triaged blathers was able to find an owner

Comments

@gyrter
Copy link

gyrter commented Nov 30, 2020

Describe the problem

Hello. I migrated from CockroachDB 20.1.8 up to 20.2.2 and got problems with application start. Our ORM, PostgreSQL JDBC, creates on start queries to pg_catalog to determine types and relations. And this queries become very slow. We continued investigation and found, that responce time correlated with number of tables in DB. Also, I found that one node in cluster works well.

To Reproduce

  1. Set up CockroachDB cluster in k8s.
  2. Create database with 10 tables
  3. Got this responce time I201130 10:52:27.993894 5920071 sql/exec_log.go:225 ⋮ [n3,client=‹10.111.7.3:38470›,hostssl,user=‹db1›] 3 112.396ms ‹exec› ‹"PostgreSQL JDBC Driver"› ‹{}› ‹"SELECT typinput = 'array_in'::REGPROC AS is_array, typtype, typname FROM pg_catalog.pg_type LEFT JOIN (SELECT ns.oid AS nspoid, ns.nspname, r.r FROM pg_namespace AS ns JOIN (SELECT s.r, (current_schemas(false))[s.r] AS nspname FROM ROWS FROM (generate_series(1, array_upper(current_schemas(false), 1))) AS s (r)) AS r USING (nspname)) AS sp ON sp.nspoid = typnamespace WHERE typname = $1 ORDER BY sp.r, pg_type.oid DESC"› ‹{$1:"'jsonb'"}› 1 ‹""› 0 ‹{ LATENCY_THRESHOLD }›
  4. Add other database with 93 tables
  5. Got responce time I201130 10:36:00.786376 5085793 sql/exec_log.go:225 ⋮ [n2,client=‹192.168.114.18:21850›,hostssl,user=‹db1›] 67 520.064ms ‹exec› ‹"PostgreSQL JDBC Driver"› ‹{}› ‹"SELECT typinput = 'array_in'::REGPROC AS is_array, typtype, typname FROM pg_catalog.pg_type LEFT JOIN (SELECT ns.oid AS nspoid, ns.nspname, r.r FROM pg_namespace AS ns JOIN (SELECT s.r, (current_schemas(false))[s.r] AS nspname FROM ROWS FROM (generate_series(1, array_upper(current_schemas(false), 1))) AS s (r)) AS r USING (nspname)) AS sp ON sp.nspoid = typnamespace WHERE typname = $1 ORDER BY sp.r, pg_type.oid DESC"› ‹{$1:"'jsonb'"}› 1 ‹""› 0 ‹{ LATENCY_THRESHOLD }›
  6. Also we created more tables and query became slower.

Expected behavior

I expected better query time for catalog queries.

Environment:

  • CockroachDB 20.2.2
  • Kubernetes 1.16.15
  • Ubuntu 18.04.5
  • Linux 5.3.0-51-generic
@blathers-crl
Copy link

blathers-crl bot commented Nov 30, 2020

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Nov 30, 2020
@rafiss rafiss added this to Triage in SQL Sessions - Deprecated via automation Nov 30, 2020
@rafiss
Copy link
Collaborator

rafiss commented Nov 30, 2020

Thanks for the report!

I haven't been able to reproduce this locally. I created a local single-node v20.2.2 cluster with 10 tables, then added 100 more tables in a different database. In both cases, the query took under 20ms.

root@:26257/defaultdb> SELECT typinput = 'array_in'::REGPROC AS is_array, typtype, typname FROM pg_catalog.pg_type LEFT JOIN (SELECT ns.oid AS nspoid, ns.nspname, r.r FROM pg_namespace AS ns JOIN (SELECT s.r, (current_schemas(false))[s.r] AS nspname FROM ROWS FROM (generate_series(1, array_upper(current_schemas(false), 1))) AS s (r)) AS r USING (nspname)) AS sp ON sp.nspoid = typnamespace WHERE typname = 'jsonb' ORDER BY sp.r, pg_type.oid DESC;
  is_array | typtype | typname
-----------+---------+----------
   false   | b       | jsonb
(1 row)

Time: 17ms total (execution 17ms / network 0ms)

A couple questions for you @gyrter

  • Can you confirm that you saw this with v20.2.2? versions 20.2.0 and 20.2.1 did have an issue with performance of pg_catalog queries, but PR release-20.2: opt: add cost for table descriptor fetch during virtual scan #56349 made it into v20.2.2 and addressed that issue. So I just want to confirm that you are running a version that includes that PR.
  • I've only tried running the query from the command line. Can you reproduce the problem by running the problem from the command line? Or does it only happen when running from the Java app, which I see also uses a prepared statement.
  • Can you run the following and share the output?
EXPLAIN SELECT typinput = 'array_in'::REGPROC AS is_array, typtype, typname FROM pg_catalog.pg_type LEFT JOIN (SELECT ns.oid AS nspoid, ns.nspname, r.r FROM pg_namespace AS ns JOIN (SELECT s.r, (current_schemas(false))[s.r] AS nspname FROM ROWS FROM (generate_series(1, array_upper(current_schemas(false), 1))) AS s (r)) AS r USING (nspname)) AS sp ON sp.nspoid = typnamespace WHERE typname = 'jsonb' ORDER BY sp.r, pg_type.oid DESC;

@gyrter
Copy link
Author

gyrter commented Dec 1, 2020

Thank you for fast responce. I will try to help you with this issue.

  • I tested JDBC on 20.2.0, 20.2.1 and 20.2.2 - all versions was affected.
  • Yes, of cause. I found more interesting query
    SELECT tmp.table_cat, tmp.table_schem, tmp.table_name, tmp.non_unique, tmp.index_qualifier, tmp.index_name, tmp.type, tmp.ordinal_position, btrim(pg_catalog.pg_get_indexdef(tmp.ci_oid, tmp.ordinal_position, false), '\"') AS column_name, CASE tmp.am_canorder WHEN true THEN CASE tmp.i_indoption[tmp.ordinal_position - 1] & 1 WHEN 1 THEN 'D' ELSE 'A' END ELSE NULL END AS asc_or_desc, tmp.cardinality, tmp.pages, tmp.filter_condition FROM (SELECT NULL AS table_cat, n.nspname AS table_schem, ct.relname AS table_name, NOT i.indisunique AS non_unique, NULL AS index_qualifier, ci.relname AS index_name, CASE i.indisclustered WHEN true THEN 1 ELSE CASE am.amname WHEN 'hash' THEN 2 ELSE 3 END END AS type, (information_schema._pg_expandarray(i.indkey)).n AS ordinal_position, ci.reltuples AS cardinality, ci.relpages AS pages, pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS filter_condition, ci.oid AS ci_oid, i.indoption AS i_indoption, am.amcanorder AS am_canorder FROM pg_catalog.pg_class AS ct JOIN pg_catalog.pg_namespace AS n ON (ct.relnamespace = n.oid) JOIN pg_catalog.pg_index AS i ON (ct.oid = i.indrelid) JOIN pg_catalog.pg_class AS ci ON (ci.oid = i.indexrelid) JOIN pg_catalog.pg_am AS am ON (ci.relam = am.oid) WHERE (true AND (n.nspname = 'public')) AND (ct.relname = 'sometable')) AS tmp ORDER BY non_unique, type, index_name, ordinal_position.
    Here is my tests:
~ # time psql -U admin -h v20_2_2 -d db1 -p26257 -f problem_query.sql
...
real	0m9.395s
user	0m0.056s
sys	0m0.009s
~ # time psql -U admin -h v20_1_8  -d db1 -p26257 -f problem_query.sql
...
real	0m1.437s
user	0m0.053s
sys	0m0.014s

Also, in different versions of CockroachDB this query has different query plans:

  • v20.2.2
                            tree                            |    field     |                 description                  
------------------------------------------------------------+--------------+----------------------------------------------
                                                            | distribution | local
                                                            | vectorized   | false
 sort                                                       |              | 
  │                                                         | order        | +non_unique,+type,+relname,+ordinal_position
  └── render                                                |              | 
       └── render                                           |              | 
            └── project set                                 |              | 
                 └── hash join                              |              | 
                      │                                     | equality     | (relam) = (oid)
                      ├── hash join                         |              | 
                      │    │                                | equality     | (oid) = (indexrelid)
                      │    ├── virtual table                |              | 
                      │    │                                | table        | pg_class@primary
                      │    └── hash join                    |              | 
                      │         │                           | equality     | (indrelid) = (oid)
                      │         ├── virtual table           |              | 
                      │         │                           | table        | pg_index@primary
                      │         └── hash join               |              | 
                      │              │                      | equality     | (relnamespace) = (oid)
                      │              ├── filter             |              | 
                      │              │    │                 | filter       | relname = 'sometable'
                      │              │    └── virtual table |              | 
                      │              │                      | table        | pg_class@primary
                      │              └── filter             |              | 
                      │                   │                 | filter       | nspname = 'public'
                      │                   └── virtual table |              | 
                      │                                     | table        | pg_namespace@primary
                      └── virtual table                     |              | 
                                                            | table        | pg_am@primary
  • v20.1.8
                              tree                               |    field    |                 description                  
-----------------------------------------------------------------+-------------+----------------------------------------------
                                                                 | distributed | false
                                                                 | vectorized  | false
 render                                                          |             | 
  └── sort                                                       |             | 
       │                                                         | order       | +non_unique,+type,+relname,+ordinal_position
       └── render                                                |             | 
            └── render                                           |             | 
                 └── project set                                 |             | 
                      └── hash-join                              |             | 
                           │                                     | type        | inner
                           │                                     | equality    | (oid) = (relam)
                           ├── virtual table                     |             | 
                           │                                     | source      | 
                           └── hash-join                         |             | 
                                │                                | type        | inner
                                │                                | equality    | (oid) = (indexrelid)
                                ├── virtual table                |             | 
                                │                                | source      | 
                                └── hash-join                    |             | 
                                     │                           | type        | inner
                                     │                           | equality    | (indrelid) = (oid)
                                     ├── virtual table           |             | 
                                     │                           | source      | 
                                     └── hash-join               |             | 
                                          │                      | type        | inner
                                          │                      | equality    | (relnamespace) = (oid)
                                          ├── filter             |             | 
                                          │    │                 | filter      | relname = 'sometable'
                                          │    └── virtual table |             | 
                                          │                      | source      | 
                                          └── filter             |             | 
                                               │                 | filter      | nspname = 'public'
                                               └── virtual table |             | 
                                                                 | source      | 

Main difference - query has join on pg_am.

  • Output for EXPLAIN SELECT typinput = 'array_in'::REGPROC AS is_array, typtype, typname FROM pg_catalog.pg_type LEFT JOIN (SELECT ns.oid AS nspoid, ns.nspname, r.r FROM pg_namespace AS ns JOIN (SELECT s.r, (current_schemas(false))[s.r] AS nspname FROM ROWS FROM (generate_series(1, array_upper(current_schemas(false), 1))) AS s (r)) AS r USING (nspname)) AS sp ON sp.nspoid = typnamespace WHERE typname = 'jsonb' ORDER BY sp.r, pg_type.oid DESC;
~ # time psql -U admin -h v20_2_2 -d db1 -p26257 -f test_query.sql 
Password for user admin: 
                  tree                  |    field     |      description       
----------------------------------------pg_am+--------------+------------------------
                                        | distribution | local
                                        | vectorized   | false
 sort                                   |              | 
  │                                     | order        | +generate_series,-oid
  └── render                            |              | 
       └── hash join (right outer)      |              | 
            │                           | equality     | (oid) = (typnamespace)
            ├── hash join               |              | 
            │    │                      | equality     | (nspname) = (nspname)
            │    ├── virtual table      |              | 
            │    │                      | table        | pg_namespace@primary
            │    └── render             |              | 
            │         └── project set   |              | 
            │              └── emptyrow |              | 
            └── filter                  |              | 
                 │                      | filter       | typname = 'jsonb'
                 └── virtual table      |              | 
                                        | table        | pg_type@primary
(18 rows)


real	0m5.982s
user	0m0.052s
sys	0m0.016s
 ~ # time psql -U admin -h v20_1_8 -d db1 -p26257 -f test_query.sql 
Password for user admin: 
                    tree                     |    field    |      description       
---------------------------------------------+-------------+------------------------
                                             | distributed | false
                                             | vectorized  | false
 render                                      |             | 
  └── sort                                   |             | 
       │                                     | order       | +generate_series,-oid
       └── render                            |             | 
            └── hash-join                    |             | 
                 │                           | type        | right outer
                 │                           | equality    | (oid) = (typnamespace)
                 ├── hash-join               |             | 
                 │    │                      | type        | inner
                 │    │                      | equality    | (nspname) = (nspname)
                 │    ├── virtual table      |             | 
                 │    │                      | source      | 
                 │    └── render             |             | 
                 │         └── project set   |             | 
                 │              └── emptyrow |             | 
                 └── filter                  |             | 
                      │                      | filter      | typname = 'jsonb'
                      └── virtual table      |             | 
                                             | source      | 
(21 rows)


real	0m0.876s
user	0m0.052s
sys	0m0.018s

All db instances run in same nodes in k8s.

@z0mb1ek
Copy link

z0mb1ek commented Dec 1, 2020

Hi. We use hibernate too, and for schema validating I see this performance issue:
Screenshot 2020-12-01 at 11 58 26

Also I see this in cockroach cloud too. Add issue https://support.cockroachlabs.com/hc/en-us/requests/6880

Add diagnostics details.
stmt-bundle-611918818502377474.zip

@rafiss rafiss moved this from Triage to Shorter term backlog in SQL Sessions - Deprecated Dec 1, 2020
@rafiss
Copy link
Collaborator

rafiss commented Dec 1, 2020

I created a remote 3-node cluster running v20.2.2, and added 100 empty tables into it.

Running SELECT tmp.table_cat, tmp.table_schem, tmp.table_name, tmp.non_unique, tmp.index_qualifier, tmp.index_name, tmp.type, tmp.ordinal_position, btrim(pg_catalog.pg_get_indexdef(tmp.ci_oid, tmp.ordinal_position, false), '\"') AS column_name, CASE tmp.am_canorder WHEN true THEN CASE tmp.i_indoption[tmp.ordinal_position - 1] & 1 WHEN 1 THEN 'D' ELSE 'A' END ELSE NULL END AS asc_or_desc, tmp.cardinality, tmp.pages, tmp.filter_condition FROM (SELECT NULL AS table_cat, n.nspname AS table_schem, ct.relname AS table_name, NOT i.indisunique AS non_unique, NULL AS index_qualifier, ci.relname AS index_name, CASE i.indisclustered WHEN true THEN 1 ELSE CASE am.amname WHEN 'hash' THEN 2 ELSE 3 END END AS type, (information_schema._pg_expandarray(i.indkey)).n AS ordinal_position, ci.reltuples AS cardinality, ci.relpages AS pages, pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS filter_condition, ci.oid AS ci_oid, i.indoption AS i_indoption, am.amcanorder AS am_canorder FROM pg_catalog.pg_class AS ct JOIN pg_catalog.pg_namespace AS n ON (ct.relnamespace = n.oid) JOIN pg_catalog.pg_index AS i ON (ct.oid = i.indrelid) JOIN pg_catalog.pg_class AS ci ON (ci.oid = i.indexrelid) JOIN pg_catalog.pg_am AS am ON (ci.relam = am.oid) WHERE (true AND (n.nspname = 'public')) AND (ct.relname = 'sometable')) AS tmp ORDER BY non_unique, type, index_name, ordinal_position

against it still only takes 45ms. So unfortunately I still have not been able to reproduce the slowness.

Thanks for sharing those explain plans. I think the plans actually are the same, though maybe the output format changed. I see both include the join on relam:

                      └── hash-join                              |             | 
                           │                                     | type        | inner
                           │                                     | equality    | (oid) = (relam)

the statement bundle you provided should be helpful. we will look into that next.

@z0mb1ek
Copy link

z0mb1ek commented Dec 2, 2020

@rafiss Hi, steps for reproduce:

  1. create database test1
  2. make IMPORT PGDUMP 'https://drive.google.com/uc?export=download&id=1USsFE_LypdaAMCRNgExaiL5L4wvLmbP4'
  3. create database test2
  4. start java application that creates simple 10 tables (I think it doesn't matter by hand or automatically)
  5. restart java app and see that it starts for 1 minute
  6. see in cockroachdb cluster panel that service latency for this start is 6s
  7. see in cockroachdb cluster panel braking statements

Also I see slow working in TablePlus too

Add more bundles for you

stmt-bundle-612213293907738625.zip
stmt-bundle-612213287988232193.zip
stmt-bundle-612212071640268801.zip
stmt-bundle-612212065837842433.zip
stmt-bundle-612212028882747393.zip
stmt-bundle-612212014046117889.zip
stmt-bundle-612212034723119105.zip
stmt-bundle-612212020206993409.zip

@rafiss
Copy link
Collaborator

rafiss commented Dec 3, 2020

Thanks for these steps. I can reproduce more of the slowness now. I think the existence of more foreign keys makes these queries slower, so using the pgdump you provided was helpful.

I believe the underlying cause is new validation logic in v20.2.x that is performed when fetching table descriptors. We fetch these descriptors every time any data from a pg_catalog table is read.

We'll work on improving the performance and include the improvement in a patch release of 20.2 as soon as we can.

@jordanlewis jordanlewis added this to Triage in SQL Foundations via automation Dec 3, 2020
@jordanlewis jordanlewis removed this from Shorter term backlog in SQL Sessions - Deprecated Dec 3, 2020
@jordanlewis jordanlewis moved this from Triage to 21.1 December in SQL Foundations Dec 3, 2020
@jordanlewis jordanlewis added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Dec 3, 2020
@blathers-crl
Copy link

blathers-crl bot commented Dec 3, 2020

Hi @jordanlewis, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Dec 4, 2020
In 20.2 we added logic to validate descriptors on the read path for getting
all descriptors and introduced a performance regression. In particular, we'd
now retrieve all databases and cross references for all tables. This stricter
validation has proven to be somewhat handy and has become utilized in some
tests to ensure that descriptors indeed are valid. Eliminating this validation
would thus be problematic. This is fortunately easy to resolve as we already
have all of the descriptors sitting in memory. With this change, we use them.

Fixes cockroachdb#57249.

In the previously added benchmark tracking the number of KV calls, the code
used to perform 32 reads and now perform 1 (the 32 mostly has to do with
resolving the system database when validating system tables).

Release note (bug fix): Fixed performance regression introduced in v20.2 to
reading virtual tables which introspect the schema.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Dec 4, 2020
In 20.2 we added logic to validate descriptors on the read path for getting
all descriptors and introduced a performance regression. In particular, we'd
now retrieve all databases and cross references for all tables. This stricter
validation has proven to be somewhat handy and has become utilized in some
tests to ensure that descriptors indeed are valid. Eliminating this validation
would thus be problematic. This is fortunately easy to resolve as we already
have all of the descriptors sitting in memory. With this change, we use them.

Fixes cockroachdb#57249.

In the previously added benchmark tracking the number of KV calls, the code
used to perform 32 reads and now perform 1 (the 32 mostly has to do with
resolving the system database when validating system tables).

Release note (bug fix): Fixed performance regression introduced in v20.2 to
reading virtual tables which introspect the schema.
craig bot pushed a commit that referenced this issue Dec 4, 2020
57231: roachtest: skip jobs/mixed-version r=dt a=pbardea

This test runs import in a mixed-version setting, which is known to be
not supported. This test is failing often enough that we're not getting
good signal from it.

Fixes #56038.

Release note: None

57382: importccl: support IMPORT in mixed-version cluster r=dt a=pbardea

This commit adds support for running IMPORT in a mixed-version cluster.
Note, that it does not add support for running an IMPORT during the
upgrade of a node however. The IMPORT job would need to be restarted in
that case.

Release note (sql change): Add support for running IMPORT in a
mixed-version cluster.

57542: catalogkv: use in-memory descriptors to validate in GetAllDescriptors r=lucy-zhang a=ajwerner

In 20.2 we added logic to validate descriptors on the read path for getting
all descriptors and introduced a performance regression. In particular, we'd
now retrieve all databases and cross references for all tables. This stricter
validation has proven to be somewhat handy and has become utilized in some
tests to ensure that descriptors indeed are valid. Eliminating this validation
would thus be problematic. This is fortunately easy to resolve as we already
have all of the descriptors sitting in memory. With this change, we use them.
    
Fixes #57249.
    
In the previously added benchmark tracking the number of KV calls, the code
used to perform 32 reads and now perform 1 (the 32 mostly has to do with
resolving the system database when validating system tables).
    
Release note (bug fix): Fixed performance regression introduced in v20.2 to
reading virtual tables which introspect the schema.


Co-authored-by: Paul Bardea <pbardea@gmail.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
ajwerner added a commit to ajwerner/cockroach that referenced this issue Dec 4, 2020
In 20.2 we added logic to validate descriptors on the read path for getting
all descriptors and introduced a performance regression. In particular, we'd
now retrieve all databases and cross references for all tables. This stricter
validation has proven to be somewhat handy and has become utilized in some
tests to ensure that descriptors indeed are valid. Eliminating this validation
would thus be problematic. This is fortunately easy to resolve as we already
have all of the descriptors sitting in memory. With this change, we use them.

Fixes cockroachdb#57249.

In the previously added benchmark tracking the number of KV calls, the code
used to perform 32 reads and now perform 1 (the 32 mostly has to do with
resolving the system database when validating system tables).

Release note (bug fix): Fixed performance regression introduced in v20.2 to
reading virtual tables which introspect the schema.
@craig craig bot closed this as completed in 4e8a985 Dec 4, 2020
SQL Foundations automation moved this from 21.1 December to Done Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants