![]() ![]() |
shivakumar |
![]() |
Hi
I have deployed 5 node Ignite cluster on K8S with persistence enabled (version 2.9.0 on java 11) I started ingesting data to 3 tables and after ingesting large amount of data using JDBC batch insertion (around 20 million records to each of 3 tables with backup set to 1), now i connected to visor shell (from one of the pod which i deployed just to use as visor shell) using the same ignite config file which is used for ignite servers and after visor shell connects to ignite cluster the unwanted wal record cleanup stopped (which should run post checkpoint ) and WAL started growing linearly as there is continuous data ingestion. This is making WAL disk run out of space and pods crashes. when i see logs there are continuous warning messages saying Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=e8bb9c22-0709-416f-88d6-16c5ca534024, timestamp=1606979158669, ptr=FileWALPointer [idx=1468, fileOff=45255321, len=9857]], history map size is 4 and if i see checkpoint finish messages Checkpoint finished [cpId=ca254956-5550-45d6-87c5-892b7e07b13b, pages=494933, markPos=FileWALPointer [idx=1472, fileOff=72673736, len=9857], walSegmentsCleared=0, walSegmentsCovered=[1470 - 1471], markDuration=464ms, pagesWrite=8558ms, fsync=3597ms, total=14027ms] here you can see walSegmentsCleared=0 means there are no WAl segments cleared even after checkpoint, not sure what is causing this behaviour. we are ingesting very large data (~25mb/s) please someone help in this issue regards, Shiva -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
![]() ![]() |
Alexandr Shapkin |
![]() |
Hi, Could you share the full logs with the exception? There is an interesting JIRA ticket that could be the reason for this [1] [1] - https://issues.apache.org/jira/browse/IGNITE-13373 From: [hidden email] Hi I have deployed 5 node Ignite cluster on K8S with persistence enabled (version 2.9.0 on java 11) I started ingesting data to 3 tables and after ingesting large amount of data using JDBC batch insertion (around 20 million records to each of 3 tables with backup set to 1), now i connected to visor shell (from one of the pod which i deployed just to use as visor shell) using the same ignite config file which is used for ignite servers and after visor shell connects to ignite cluster the unwanted wal record cleanup stopped (which should run post checkpoint ) and WAL started growing linearly as there is continuous data ingestion. This is making WAL disk run out of space and pods crashes. when i see logs there are continuous warning messages saying Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=e8bb9c22-0709-416f-88d6-16c5ca534024, timestamp=1606979158669, ptr=FileWALPointer [idx=1468, fileOff=45255321, len=9857]], history map size is 4 and if i see checkpoint finish messages
Checkpoint finished [cpId=ca254956-5550-45d6-87c5-892b7e07b13b, pages=494933, markPos=FileWALPointer [idx=1472, fileOff=72673736, len=9857], walSegmentsCleared=0, walSegmentsCovered=[1470 - 1471], markDuration=464ms, pagesWrite=8558ms, fsync=3597ms, total=14027ms] here you can see walSegmentsCleared=0 means there are no WAl segments cleared even after checkpoint, not sure what is causing this behaviour. we are ingesting very large data (~25mb/s) please someone help in this issue regards, Shiva -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Alex Shapkin
|
![]() ![]() |
shivakumar |
![]() |
Thanks for your reply Alexandr Shapkin
The ticket you shared https://issues.apache.org/jira/browse/IGNITE-13373 more related to preloading releaseHistoryForPreloading() But the issue am facing is somehow WAL segments which are not needed for crash recovery after checkpoint are not getting cleaned, Do you think there is something fixed in the above jira ticket related to my issue ? I tried collecting logs, the logs are very huge, i couldn't find any exceptions or error messages but I saw below warnings: "Partition states validation has failed for group" "Could not clear historyMap due to WAL reservation on cp" Please find the attached logs files (not full log but part of logs where i saw above warnings) instance1.log <http://apache-ignite-users.70518.x6.nabble.com/file/t2244/instance1.log> historymap.log <http://apache-ignite-users.70518.x6.nabble.com/file/t2244/historymap.log> Shiva -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
![]() ![]() |
shivakumar |
![]() |
Hi Alexandr Shapkin
With master branch codebase also I'm able to re-produce this issue (reproducible on 2.7.6, 2.9 and master) Looks like there is a major bug in the way older WAL segment clean-up is implemented during checkpoint. Not sure how connecting to visor is causing this issue, any chance that Visor puts some deadlock or lock on WAL segments ? I grep for warning messages in server logs and i saw "history map size is" growing like 80,81,82...... I don't know how WAL segment clean-up is implemented, do you suspect on anything ?? There are no exceptions also in server logs. I'm thinking to file a bug as infinitely growing WAL is a major concern. {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:39:35,555Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 80"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:40:27,647Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 81"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:41:19,744Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 82"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:42:12,946Z","logger":"CheckpointPagesWriterFactory","timezone":"UTC","marker":"","log":"1 checkpoint pages were not written yet due to unsuccessful page write lock acquisition and will be retried"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:42:12,946Z","logger":"CheckpointPagesWriterFactory","timezone":"UTC","marker":"","log":"2 checkpoint pages were not written yet due to unsuccessful page write lock acquisition and will be retried"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:42:16,872Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 83"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:43:16,064Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 84"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:44:17,383Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 85"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:44:56,140Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 86"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:45:42,978Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 87"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:46:22,415Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 88"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:47:03,778Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 89"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:47:52,676Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 90"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:48:34,504Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 91"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:50:08,610Z","logger":"CheckpointPagesWriterFactory","timezone":"UTC","marker":"","log":"1 checkpoint pages were not written yet due to unsuccessful page write lock acquisition and will be retried"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:50:11,153Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 93"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:51:10,393Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 94"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:52:06,844Z","logger":"CheckpointPagesWriterFactory","timezone":"UTC","marker":"","log":"2 checkpoint pages were not written yet due to unsuccessful page write lock acquisition and will be retried"} {"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:52:10,559Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could not clear historyMap due to WAL reservation on cp: CheckpointEntry [id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546, ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is 95"} -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
Free forum by Nabble | Edit this page |