Failed to archive WAL segment

classic Classic list List threaded Threaded
2 messages Options
shivakumar shivakumar
Reply | Threaded
Open this post in threaded view
|

Failed to archive WAL segment

HI all,

I have 7 node ignite cluster running on kubernetes platform, each instance
is configured with 64GB total RAM(32GB Heap space + 12 GB default data
region + remaining 18GB for ignite process), 6 core CPU, 12GB disk mount for
WAL + WAL archive, 1 TB separate disk mount for native persistence.

My problem is one of the pod (ignite instance) went to crashLoopBackOff
state and it is not recovering from crash

[root@ignite-stability-controller stability]# kubectl get pods | grep
ignite-server
ignite-cluster-ignite-server-0                          3/3     Running                
5            3d19h
ignite-cluster-ignite-server-1                          3/3     Running                
5            3d19h
ignite-cluster-ignite-server-2                          3/3     Running                
5            3d19h
ignite-cluster-ignite-server-3                          3/3     Running                
5            3d19h
ignite-cluster-ignite-server-4                          3/3     Running                
5            3d19h
ignite-cluster-ignite-server-5                          3/3     Running                
5            3d19h
*ignite-cluster-ignite-server-6                          2/3    
CrashLoopBackOff   342        3d19h*
ignite-server-visor-5df679d57-p4rf4                1/1     Running                
0            3d19h

If i check the logs of crashed instance it says (logs are in different
formats)

:"INFO","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,149Z","logger":"FsyncModeFileWriteAheadLogManager","timezone":"UTC","marker":"","log":"Starting
to copy WAL segment [absIdx=50008, segIdx=8,
origFile=/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal,
dstFile=/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal]"}
{"type":"log","host":"ignite-cluster-ignite-server-6","level":"INFO","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,154Z","logger":"GridClusterStateProcessor","timezone":"UTC","marker":"","log":"Writing
BaselineTopology[id=1]"}
{"type":"log","host":"ignite-cluster-ignite-server-6","level":"ERROR","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,170Z","logger":"","timezone":"UTC","marker":"","log":"Critical
system error detected. Will be handled accordingly to configured handler
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]],
failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class
o.a.i.IgniteCheckedException: Failed to archive WAL segment
[srcFile=/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal,
dstFile=/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal.tmp]]]
class org.apache.ignite.IgniteCheckedException: Failed to archive WAL
segment
[srcFile=/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal,
dstFile=/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal.tmp]|      
at
org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileArchiver.archiveSegment(FsyncModeFileWriteAheadLogManager.java:1826)|    
at
org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileArchiver.body(FsyncModeFileWriteAheadLogManager.java:1622)|      
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)|
at java.lang.Thread.run(Thread.java:748)|Caused by:
java.nio.file.FileSystemException:
/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal
->
/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal.tmp:
No space left on device|     at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)|    
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)|    
at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:253)|     at
sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)|        at
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)|    
at java.nio.file.Files.copy(Files.java:1274)|   at
org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileArchiver.archiveSegment(FsyncModeFileWriteAheadLogManager.java:1813)|    
... 3 more"}
{"type":"log","host":"ignite-cluster-ignite-server-6","level":"WARN","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,171Z","logger":"FailureProcessor","timezone":"UTC","marker":"","log":"No
deadlocked threads detected."}

and when I checked disk usage disk volume mounted for WAL+WAL archive is
full

Filesystem      Size  Used Avail Use% Mounted on
overlay         158G  8.9G  142G   6% /
tmpfs            63G     0   63G   0% /dev
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/vda1       158G  8.9G  142G   6% /etc/hosts
tmpfs            63G   12K   63G   1% /opt/cert
shm              64M     0   64M   0% /dev/shm
*/dev/vdc         12G   12G  7.1M 100% /opt/ignite/wal*
/dev/vdb       1008G  110G  899G  11% /opt/ignite/persistence
tmpfs            63G  8.0K   63G   1% /etc/ignite-ssl-certs/tls.key
tmpfs            63G   12K   63G   1%
/run/secrets/kubernetes.io/serviceaccount
tmpfs            63G     0   63G   0% /proc/acpi
tmpfs            63G     0   63G   0% /proc/scsi
tmpfs            63G     0   63G   0% /sys/firmware


According to ignite documentation on WAL archive
https://apacheignite.readme.io/docs/write-ahead-log#section-wal-archive, it
says wal archive size is 4 times the checkpoint buffer size and also
checkpoint buffer size is a function of data region
https://apacheignite.readme.io/docs/durable-memory-tuning#section-checkpointing-buffer-size
(since i have 12GB data region checkpoint buffer size set by default to 2GB
)
that means WAL archive size is set to 4 times of 2GB = 8 GB  
but i mounted 12gb disk volume for WAL+WAL archive still it is full ?
Iam seeing this only on 1 node and once on few nodes in my earlier
deployment

regards,
shiva
 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Failed to archive WAL segment

Hi,

How many folders you have in /opt/ignite/wal/ ? Is there a chance that you have there 2 folders with different node IDs? Can you share your configuration?

Thanks,
Evgenii

пн, 29 апр. 2019 г. в 11:17, shivakumar <[hidden email]>:
HI all,

I have 7 node ignite cluster running on kubernetes platform, each instance
is configured with 64GB total RAM(32GB Heap space + 12 GB default data
region + remaining 18GB for ignite process), 6 core CPU, 12GB disk mount for
WAL + WAL archive, 1 TB separate disk mount for native persistence.

My problem is one of the pod (ignite instance) went to crashLoopBackOff
state and it is not recovering from crash

[root@ignite-stability-controller stability]# kubectl get pods | grep
ignite-server
ignite-cluster-ignite-server-0                          3/3     Running               
5            3d19h
ignite-cluster-ignite-server-1                          3/3     Running               
5            3d19h
ignite-cluster-ignite-server-2                          3/3     Running               
5            3d19h
ignite-cluster-ignite-server-3                          3/3     Running               
5            3d19h
ignite-cluster-ignite-server-4                          3/3     Running               
5            3d19h
ignite-cluster-ignite-server-5                          3/3     Running               
5            3d19h
*ignite-cluster-ignite-server-6                          2/3   
CrashLoopBackOff   342        3d19h*
ignite-server-visor-5df679d57-p4rf4                1/1     Running               
0            3d19h

If i check the logs of crashed instance it says (logs are in different
formats)

:"INFO","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,149Z","logger":"FsyncModeFileWriteAheadLogManager","timezone":"UTC","marker":"","log":"Starting
to copy WAL segment [absIdx=50008, segIdx=8,
origFile=/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal,
dstFile=/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal]"}
{"type":"log","host":"ignite-cluster-ignite-server-6","level":"INFO","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,154Z","logger":"GridClusterStateProcessor","timezone":"UTC","marker":"","log":"Writing
BaselineTopology[id=1]"}
{"type":"log","host":"ignite-cluster-ignite-server-6","level":"ERROR","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,170Z","logger":"","timezone":"UTC","marker":"","log":"Critical
system error detected. Will be handled accordingly to configured handler
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]],
failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class
o.a.i.IgniteCheckedException: Failed to archive WAL segment
[srcFile=/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal,
dstFile=/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal.tmp]]]
class org.apache.ignite.IgniteCheckedException: Failed to archive WAL
segment
[srcFile=/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal,
dstFile=/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal.tmp]|       
at
org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileArchiver.archiveSegment(FsyncModeFileWriteAheadLogManager.java:1826)|   
at
org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileArchiver.body(FsyncModeFileWriteAheadLogManager.java:1622)|     
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)|
at java.lang.Thread.run(Thread.java:748)|Caused by:
java.nio.file.FileSystemException:
/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal
->
/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal.tmp:
No space left on device|     at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)|     
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)|     
at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:253)|     at
sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)|        at
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)|   
at java.nio.file.Files.copy(Files.java:1274)|   at
org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileArchiver.archiveSegment(FsyncModeFileWriteAheadLogManager.java:1813)|   
... 3 more"}
{"type":"log","host":"ignite-cluster-ignite-server-6","level":"WARN","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,171Z","logger":"FailureProcessor","timezone":"UTC","marker":"","log":"No
deadlocked threads detected."}

and when I checked disk usage disk volume mounted for WAL+WAL archive is
full

Filesystem      Size  Used Avail Use% Mounted on
overlay         158G  8.9G  142G   6% /
tmpfs            63G     0   63G   0% /dev
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/vda1       158G  8.9G  142G   6% /etc/hosts
tmpfs            63G   12K   63G   1% /opt/cert
shm              64M     0   64M   0% /dev/shm
*/dev/vdc         12G   12G  7.1M 100% /opt/ignite/wal*
/dev/vdb       1008G  110G  899G  11% /opt/ignite/persistence
tmpfs            63G  8.0K   63G   1% /etc/ignite-ssl-certs/tls.key
tmpfs            63G   12K   63G   1%
/run/secrets/kubernetes.io/serviceaccount
tmpfs            63G     0   63G   0% /proc/acpi
tmpfs            63G     0   63G   0% /proc/scsi
tmpfs            63G     0   63G   0% /sys/firmware


According to ignite documentation on WAL archive
https://apacheignite.readme.io/docs/write-ahead-log#section-wal-archive, it
says wal archive size is 4 times the checkpoint buffer size and also
checkpoint buffer size is a function of data region
https://apacheignite.readme.io/docs/durable-memory-tuning#section-checkpointing-buffer-size
(since i have 12GB data region checkpoint buffer size set by default to 2GB
)
that means WAL archive size is set to 4 times of 2GB = 8 GB 
but i mounted 12gb disk volume for WAL+WAL archive still it is full ?
Iam seeing this only on 1 node and once on few nodes in my earlier
deployment

regards,
shiva




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/