Streamer nodes (Kafka streamer as grid service - node singleton)
- 2 Gb memory requested
- allowOverwrite false
- autoflushFrequency 200ms
- 16 consumers (64 partitions in topic)
Streamer is configured to have a stream receiver, a StreamTransformer that
checks an special case where I have to chose which record I will keep.
Records are of 1.5 Kb (avg)
They are deserialized and converted into domain objects that are streamed as
BinaryObjects to the cache.
Started with a clean environment. No data in cache, no data in wal/storage
volumes, no data in the topic.
Input data is generated at a constante rate of 1K mesages per second.
First 20 minutes, cache size grow linearly. After that stays almost flat.
Thats expected since ExpiryPolicy was set to 20 min.
Around the hour, the lag in the consumers started to grow.
After that, everything goes wrong.
WAL size grew beyond the limits, exactly doubled before Kubernetes kills the
Around the same moment, memory usage started to grow to near the limit
Throttling times and checkpointing duration were almost the same during the
test. This last one is really high, (2 min avg), but I don't know if that is
espected or not since I don't have nothing to compare.
After 2 nodes were killed, they never join the cluster again.
I increase the size of the wal volume size still they didn't join.
Control.sh utility list both nodes as offline.
Logs output a message like this:
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [workerName=sys-stripe-6,
After restarting again them, one joined the cluster but not the other.
Control.sh utility displayed the node as offline.
By mistake I deleted the content of the wal folder. Shame on me.
Now, the node don't even start.
Node log displays:
JVM will be halted immediately due to the failure:
[failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
o.a.i.i.processors.cache.persistence.StorageException: Failed to read
checkpoint record from WAL, persistence consistency cannot be guaranteed.
Make sure configuration points to correct WAL folders and WAL folder is
properly mounted [ptr=WALPointer [idx=179, fileOff=236972130, len=15006],
What I think is expected.
Now the node is completely unusable.
Finally my questions are:
- How can I reuse that node? Can I reuse it? Is there a way to clean the
data and rejoin the node?
- Do I lost the data of that node? It should be recovered from backups once
I remove the node from baseline, is that correct?
- If I increase the input rate to 2K the lag generated at the consumers
becomes unmanaged. Adding more consumers will not help since they are
already matched with topic partitions.
- 1 K messages per second is really really really slow.
- How exactly WAL works? Why I'm constantly running out of space here.
- Any clue of what I'm doing wrong?