Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures in test_nbtree_pagesplit_cycleid #10390

Open
jcsp opened this issue Jan 14, 2025 · 8 comments
Open

Failures in test_nbtree_pagesplit_cycleid #10390

jcsp opened this issue Jan 14, 2025 · 8 comments
Assignees
Labels
a/test/flaky Area: related to flaky tests a/test Area: related to testing c/compute Component: compute, excluding postgres itself

Comments

@jcsp jcsp added a/test Area: related to testing c/compute Component: compute, excluding postgres itself labels Jan 14, 2025
@MMeent MMeent added the a/test/flaky Area: related to flaky tests label Jan 15, 2025
github-merge-queue bot pushed a commit that referenced this issue Jan 16, 2025
This should fix the largest source of flakyness of
test_nbtree_pagesplit_cycleid.

## Problem

#10390

## Summary of changes

By using a guaranteed-flushed LSN, we ensure that PS won't have to wait
forever.

(If it does wait forever, we know the issue can't be with Compute's WAL)
@MMeent
Copy link
Contributor

MMeent commented Jan 20, 2025

The biggest source of flakyness of the test has been reduced significantly since #10413, with remaining flakyness (1 failure every few days) not quite fully understood.

@alexanderlaw
Copy link

alexanderlaw commented Jan 20, 2025

I've come across a failure of this test when testing a sans-enabled build on ARM:
https://neon-github-public-dev.s3.amazonaws.com/reports/branch-enable-sanitizers-for-v17/12862934608/index.html#/testresult/33599e40bfabe4ea

test_runner/regress/test_nbtree_pagesplit_cycleid.py:122: in test_nbtree_pagesplit_cycleid
    assert (
E   AssertionError: 3 page splits with cycle ID expected; actual [(2, 'cd03')]
E   assert (1 == 1 and 2 == 3)
E    +  where 1 = len([(2, 'cd03')])

and then reproduced it locally, on x86_64, when running 8 test instances (also with sanitizers) in parallel, on iteration 1. And I can reproduce it even with sleep(2) -> sleep(10). Will try to investigate this.

@alexanderlaw
Copy link

By the way, I wonder whether [1, pg_relation_size('t_uidx'::regclass) / 8192] here is a correct range for block numbers?

FROM generate_series(1, pg_relation_size('t_uidx'::regclass) / 8192) AS blkno,

Doesn't get_raw_page_at_lsn() return blocks numbered from 0?

As far as I can see, get_raw_page_at_lsn() doesn't check blocknum validity (e.g. it works for blocknum = 100000 or blocknum = -1), but when requesting block 10 from the t_uidx relation, I get a zero-only page, while page 0 returned contains some data.

@knizhnik
Copy link
Contributor

By the way, I wonder whether [1, pg_relation_size('t_uidx'::regclass) / 8192] here is a correct range for block numbers?
Page 0 is index header.

@alexanderlaw
Copy link

Page 0 is index header.

Yes, I understand, my question was mostly about blocknum passed to get_raw_page_at_lsn(): is it 0- or 1-based...

@knizhnik
Copy link
Contributor

This test checks btpo_cycleid which is part of nbtree page opaque structure. It is available at all nitre pages except header.

@alexanderlaw
Copy link

Still trying to find out what makes the test fail, but I see the following difference with the modified query:

...
    SELECT blkno,
           encode(cycle_id, 'hex')
     FROM parsed_pages
    WHERE encode(cycle_id, 'hex') != '0000';

[(1, '4d6a'), (7, '4d6a'), (8, '4d6a')]
(when the test passes)
vs

 blkno | encode 
-------+--------
     8 | 6b6f
     9 | 6b6f

when it fails.

I've also added

@@ -1517,6 +1517,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
     origpagenumber = BufferGetBlockNumber(buf);
     /* NEON: store the page's former cycle ID for FPI check later */
     origcycleid = oopaque->btpo_cycleid;
+elog(LOG, "!!!_bt_split| origpagenumber: %d", origpagenumber);

for debugging and seeing the following:

PG:2025-01-21 09:35:25.224 GMT [postgres][470331:1730][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 1
PG:2025-01-21 09:35:25.229 GMT [postgres][470331:2839][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 2
PG:2025-01-21 09:35:25.233 GMT [postgres][470331:3946][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 4
PG:2025-01-21 09:35:25.237 GMT [postgres][470331:5053][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 5
PG:2025-01-21 09:35:25.241 GMT [postgres][470331:6160][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 6
PG:2025-01-21 09:35:35.533 GMT [postgres][470331:8170][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 7
PG:2025-01-21 09:35:35.537 GMT [postgres][470331:9272][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 1

(when the test passes)

PG:2025-01-21 09:56:58.464 GMT [postgres][501164:1760][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 1
PG:2025-01-21 09:56:58.471 GMT [postgres][501164:2869][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 2
PG:2025-01-21 09:56:58.478 GMT [postgres][501164:3976][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 4
PG:2025-01-21 09:56:58.482 GMT [postgres][501164:5083][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 5
PG:2025-01-21 09:56:58.489 GMT [postgres][501164:6190][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 6
PG:2025-01-21 09:57:08.870 GMT [postgres][501164:8181][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 7
PG:2025-01-21 09:57:08.874 GMT [postgres][501164:9284][client backend] [[unknown]] LOG:  !!!_bt_split| origpagenumber: 8

(when the test fails)

So maybe the index can be split in several ways?

@ololobus
Copy link
Member

@MMeent will check the remaining failures later this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test/flaky Area: related to flaky tests a/test Area: related to testing c/compute Component: compute, excluding postgres itself
Projects
None yet
Development

No branches or pull requests

5 participants