Could not clear historyMap due to WAL reservation on cp

classic Classic list List threaded Threaded
4 messages Options
shivakumar shivakumar
Reply | Threaded
Open this post in threaded view
|

Could not clear historyMap due to WAL reservation on cp

Hi
I have deployed 5 node Ignite cluster on K8S with persistence enabled
(version 2.9.0 on java 11)
I started ingesting data to 3 tables and after ingesting large amount of
data using JDBC batch insertion (around 20 million records to each of 3
tables with backup set to 1), now i connected to visor shell (from one of
the pod which i deployed just to use as visor shell) using the same ignite
config file which is used for ignite servers and after visor shell connects
to ignite cluster the unwanted wal record cleanup stopped (which should run
post checkpoint ) and WAL started growing linearly as there is continuous
data ingestion. This is making WAL disk run out of space and pods crashes.

when i see logs there are continuous warning messages saying
Could not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=e8bb9c22-0709-416f-88d6-16c5ca534024, timestamp=1606979158669,
ptr=FileWALPointer [idx=1468, fileOff=45255321, len=9857]], history map size
is 4


and if i see checkpoint finish messages
 
Checkpoint finished [cpId=ca254956-5550-45d6-87c5-892b7e07b13b,
pages=494933, markPos=FileWALPointer [idx=1472, fileOff=72673736, len=9857],
walSegmentsCleared=0, walSegmentsCovered=[1470 - 1471], markDuration=464ms,
pagesWrite=8558ms, fsync=3597ms, total=14027ms]

here you can see walSegmentsCleared=0   means there are no WAl segments
cleared even after checkpoint, not sure what is causing this behaviour.

we are ingesting very large data (~25mb/s)
please someone help in this issue


regards,
Shiva





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Alexandr Shapkin Alexandr Shapkin
Reply | Threaded
Open this post in threaded view
|

RE: Could not clear historyMap due to WAL reservation on cp

Hi,

 

Could you share the full logs with the exception?

 

There is an interesting  JIRA ticket that could be the reason for this [1]

 

[1] - https://issues.apache.org/jira/browse/IGNITE-13373

 

 

 

From: [hidden email]
Sent: Thursday, December 3, 2020 10:51 AM
To: [hidden email]
Subject: Could not clear historyMap due to WAL reservation on cp

 

Hi

I have deployed 5 node Ignite cluster on K8S with persistence enabled

(version 2.9.0 on java 11)

I started ingesting data to 3 tables and after ingesting large amount of

data using JDBC batch insertion (around 20 million records to each of 3

tables with backup set to 1), now i connected to visor shell (from one of

the pod which i deployed just to use as visor shell) using the same ignite

config file which is used for ignite servers and after visor shell connects

to ignite cluster the unwanted wal record cleanup stopped (which should run

post checkpoint ) and WAL started growing linearly as there is continuous

data ingestion. This is making WAL disk run out of space and pods crashes.

 

when i see logs there are continuous warning messages saying

Could not clear historyMap due to WAL reservation on cp: CheckpointEntry

[id=e8bb9c22-0709-416f-88d6-16c5ca534024, timestamp=1606979158669,

ptr=FileWALPointer [idx=1468, fileOff=45255321, len=9857]], history map size

is 4

 

 

and if i see checkpoint finish messages

 

Checkpoint finished [cpId=ca254956-5550-45d6-87c5-892b7e07b13b,

pages=494933, markPos=FileWALPointer [idx=1472, fileOff=72673736, len=9857],

walSegmentsCleared=0, walSegmentsCovered=[1470 - 1471], markDuration=464ms,

pagesWrite=8558ms, fsync=3597ms, total=14027ms]

 

here you can see walSegmentsCleared=0   means there are no WAl segments

cleared even after checkpoint, not sure what is causing this behaviour.

 

we are ingesting very large data (~25mb/s)

please someone help in this issue

 

 

regards,

Shiva

 

 

 

 

 

--

Sent from: http://apache-ignite-users.70518.x6.nabble.com/

 

Alex Shapkin
shivakumar shivakumar
Reply | Threaded
Open this post in threaded view
|

RE: Could not clear historyMap due to WAL reservation on cp

Thanks for your reply Alexandr Shapkin

The ticket you shared https://issues.apache.org/jira/browse/IGNITE-13373
more related to preloading
    releaseHistoryForPreloading()
But the issue am facing is somehow WAL segments which are not needed for
crash recovery after checkpoint are not getting cleaned, Do you think there
is something fixed in the above jira ticket related to my issue ?

I tried collecting logs, the logs are very huge, i couldn't find any
exceptions or error messages but I saw below warnings:
"Partition states validation has failed for group"
"Could not clear historyMap due to WAL reservation on cp"

Please find the attached logs files (not full log but part of logs where i
saw above warnings) instance1.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2244/instance1.log>  
historymap.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2244/historymap.log>  

Shiva



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
shivakumar shivakumar
Reply | Threaded
Open this post in threaded view
|

RE: Could not clear historyMap due to WAL reservation on cp

Hi Alexandr Shapkin
With master branch codebase also I'm able to re-produce this issue
(reproducible on 2.7.6, 2.9 and master)
Looks like there is a major bug in the way older WAL segment clean-up is
implemented during checkpoint.
Not sure how connecting to visor is causing this issue, any chance that
Visor puts some deadlock or lock on WAL segments ?

I grep for warning messages in server logs and i saw "history map size is"
growing like 80,81,82......
I don't know how WAL segment clean-up is implemented, do you suspect on
anything ??
There are no exceptions also in server logs.
I'm thinking to file a bug as infinitely growing WAL is a major concern.



 
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:39:35,555Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
80"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:40:27,647Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
81"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:41:19,744Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
82"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:42:12,946Z","logger":"CheckpointPagesWriterFactory","timezone":"UTC","marker":"","log":"1
checkpoint pages were not written yet due to unsuccessful page write lock
acquisition and will be retried"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:42:12,946Z","logger":"CheckpointPagesWriterFactory","timezone":"UTC","marker":"","log":"2
checkpoint pages were not written yet due to unsuccessful page write lock
acquisition and will be retried"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:42:16,872Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
83"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:43:16,064Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
84"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:44:17,383Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
85"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:44:56,140Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
86"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:45:42,978Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
87"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:46:22,415Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
88"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:47:03,778Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
89"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:47:52,676Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
90"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:48:34,504Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
91"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:50:08,610Z","logger":"CheckpointPagesWriterFactory","timezone":"UTC","marker":"","log":"1
checkpoint pages were not written yet due to unsuccessful page write lock
acquisition and will be retried"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:50:11,153Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
93"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:51:10,393Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
94"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:52:06,844Z","logger":"CheckpointPagesWriterFactory","timezone":"UTC","marker":"","log":"2
checkpoint pages were not written yet due to unsuccessful page write lock
acquisition and will be retried"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:52:10,559Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
95"}
 





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/