Hi Guys - For some weird reason, files are filling up in wal folder and right now I have around 65,000 files and occupies almost 4.5TB disk. Ideally it should not be more than 10 files right? Also i have disabled wal archiving. Why is this happening? and what am i missing?
Following is my configuration regarding wal configuration
In the case with WAL archive turned off, there might be more that 10 files
in /wal folder. It doesn't acts like a ring buffer in this case, the
segments are just created in the same folder, but the size should be
controlled by DataStorageConfiguration.maxWalArchiveSize (defaults to 4x
checkpoint buffer) in latest version.
Could you please tell the exact Ignite version where you do see this issue?
And have you noticed, had WAL segments ever been deleted, or they are
stacking from the very first segment?
Do you have the logs for this case, how often checkpoints are created?
Hi - I am currently using ignite version 2.7.6 and the files do get deleted
whenever i restart the server but after that they continuously stack-up. One
thing that i have noticed in the log files is this message "Could not clear
historyMap due to WAL reservation on cp:". I checked the code and found that
its either file is locked or reserved( not sure what is reserved). Any
reason why the wal files are getting locked up or reserved.
Could you please share the full log? This must shed some light on events
happened prior to the issue.
I suspect that there is some checkpoint process failure logged, you might
look up for occurences of "Failed to process checkpoint" or "Failed to find
checkpoint record at the given WAL pointer" in logs. Another possibility
that comes to mind is WAL reservation for rebalance.
Could you shortly describe the workload under which you do experience this
issue? Any baseline changes, actions on WAL, what are the data volumes that
are streamed/inserted/removed into caches using specific API.
Hi Anton - Thanks a lot and this helps me understand the problem.
I am still trying to get the logs from production and it might take some
I did see a message in the logs saying "checkpoint process failed" - What
are the consequences and how should i handle such errors. What are the
reasons for which i could get into this error?? Yes there is a node that
went down and rebalance was happening I guess. Could this create problems
and what sort ?? No actions on baseline or WAL.
As far as data streaming is concerned, data was streamed at 80K events per
second. each event is about 1KB size. And also lot of Ignite SQLs are being
executed (INSERSTS AND UPDATES)
One important question: if you can recall, did you start a clean cluster
with WAL /WAL archive pointing to the very same directory, or you have
stopped the node without WAL /PDS cleanup and changed this setting?
So, when this configuration was applied, were there real WAL segment files
in WAL folder?
1. Backup the whole work directory.
2. Remove $IGNITE_HOME/work/wal and $IGNITE_HOME/work/db/NODE_UUID/cp
(replace with real node UUID or consistent id if you have set one)
directories and try to restart with the same configuration.