Node failure with "Failed to write buffer." error

classic Classic list List threaded Threaded
10 messages Options
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Node failure with "Failed to write buffer." error

This post was updated on .
Hi folks,

We have been experiencing node failures with the error "Failed to write
buffer." recently. Any ideas or optimizations not to get the error and node
failure?

Thanks...

[2019-08-22T01:20:55,916][ERROR][wal-write-worker%null-#221][] Critical
system error detected. Will be handled accordingly to configured handler
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=CRITICAL_ERROR, err=class
o.a.i.i.processors.cache.persistence.StorageException: Failed to write
buffer.]]
org.apache.ignite.internal.processors.cache.persistence.StorageException:
Failed to write buffer.
        at
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3484)
[ignite-core-2.7.5.jar:2.7.5]
        at
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3301)
[ignite-core-2.7.5.jar:2.7.5]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.5.jar:2.7.5]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
Caused by: java.nio.channels.ClosedChannelException
        at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
~[?:1.8.0_201]
        at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253)
~[?:1.8.0_201]
        at
org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
~[ignite-core-2.7.5.jar:2.7.5]
        at
org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
~[ignite-core-2.7.5.jar:2.7.5]
        at
org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
~[ignite-core-2.7.5.jar:2.7.5]
        at
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3477)
~[ignite-core-2.7.5.jar:2.7.5]
        ... 3 more
[2019-08-22T01:20:55,921][WARN
][wal-write-worker%null-#221][FailureProcessor] No deadlocked threads
detected.
[2019-08-22T01:20:56,347][WARN
][wal-write-worker%null-#221][FailureProcessor] Thread dump at 2019/08/22
01:20:56 UTC


*Ignite version*: 2.7.5
*Cluster size*: 16
*Client size*: 22
*Cluster OS version*: Centos 7
*Cluster Kernel version*: 4.4.185-1.el7.elrepo.x86_64
*Java version* :
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

Current disk sizes;
Screen_Shot_2019-08-22_at_12.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/Screen_Shot_2019-08-22_at_12.png
Ignite and gc logs;
ignite-9.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/ignite-9.zip
Ignite configuration file;
default-config.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/default-config.xml



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: Node failure with "Failed to write buffer." error

Ivan, Alex Goncharuk,

The exception trace is not helpful, it's not obvious what might be a reason and how to address it. How do we tackle these problems?

Ibrahim, please attach all the log files for a detailed look.

-
Denis


On Thu, Aug 22, 2019 at 3:08 AM ihalilaltun <[hidden email]> wrote:
Hi folks,

We have been experiencing node failures with the error "Failed to write
buffer." recently. Any ideas or optimizations not to get the error and node
failure?

Thanks...

[2019-08-22T01:20:55,916][ERROR][wal-write-worker%null-#221][] Critical
system error detected. Will be handled accordingly to configured handler
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=CRITICAL_ERROR, err=class
o.a.i.i.processors.cache.persistence.StorageException: Failed to write
buffer.]]
org.apache.ignite.internal.processors.cache.persistence.StorageException:
Failed to write buffer.
        at
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3484)
[ignite-core-2.7.5.jar:2.7.5]
        at
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3301)
[ignite-core-2.7.5.jar:2.7.5]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.5.jar:2.7.5]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
Caused by: java.nio.channels.ClosedChannelException
        at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
~[?:1.8.0_201]
        at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253)
~[?:1.8.0_201]
        at
org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
~[ignite-core-2.7.5.jar:2.7.5]
        at
org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
~[ignite-core-2.7.5.jar:2.7.5]
        at
org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
~[ignite-core-2.7.5.jar:2.7.5]
        at
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3477)
~[ignite-core-2.7.5.jar:2.7.5]
        ... 3 more
[2019-08-22T01:20:55,921][WARN
][wal-write-worker%null-#221][FailureProcessor] No deadlocked threads
detected.
[2019-08-22T01:20:56,347][WARN
][wal-write-worker%null-#221][FailureProcessor] Thread dump at 2019/08/22
01:20:56 UTC


*Ignite version*: 2.7.5
*Cluster size*: 16
*Client size*: 22
*Cluster OS version*: Centos 7
*Cluster Kernel version*: 4.4.185-1.el7.elrepo.x86_64
*Java version* :
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

Current disk sizes;
Screen_Shot_2019-08-22_at_12.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/Screen_Shot_2019-08-22_at_12.png
Ignite and gc logs;
ignite-9.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/ignite-9.zip
Ignite configuration file;
default-config.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/default-config.xml



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Re: Node failure with "Failed to write buffer." error

Hi Dmagda

Here is the all log files that can get from the server;
ignite.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/ignite.zip>  
gc.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2515/gc.zip>  
gc-logs-continnued
<https://drive.google.com/file/d/1LcEJG7FyrCoVcm8mZiesV4ScQd86lVcj/view?usp=sharing>  



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
mmuzaf mmuzaf
Reply | Threaded
Open this post in threaded view
|

Re: Node failure with "Failed to write buffer." error

Hello,

Did you change IGNITE_WAL_MMAP system variable? (true by default).
Can you also attach your Ignite configuration file?

I've checked the log you provided and it seems to me that during the
file rollover WAL procedure the current wal-file is closed, but the
WalWriter thread corresponding to this file is not stopped. Usually,
it works fine but why it does not work in your case I don't know.


On Fri, 23 Aug 2019 at 10:13, ihalilaltun <[hidden email]> wrote:

>
> Hi Dmagda
>
> Here is the all log files that can get from the server;
> ignite.zip
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2515/ignite.zip>
> gc.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2515/gc.zip>
> gc-logs-continnued
> <https://drive.google.com/file/d/1LcEJG7FyrCoVcm8mZiesV4ScQd86lVcj/view?usp=sharing>
>
>
>
> -----
> İbrahim Halil Altun
> Senior Software Engineer @ Segmentify
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Re: Node failure with "Failed to write buffer." error

Hi Mmuzaf

IGNITE_WAL_MMAP is false in our environment.

Here is the configuration;
<?xml version="1.0" encoding="UTF-8"?>

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="
        http://www.springframework.org/schema/beans
        http://www.springframework.org/schema/beans/spring-beans.xsd">
    <bean id="ignite.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="gridLogger">
            <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
                <constructor-arg type="java.lang.String"
value="/etc/apache-ignite/ignite-log4j2.xml"/>
            </bean>
        </property>
        <property name="communicationSpi">
            <bean
class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
                <property name="usePairedConnections" value="true"/>
            </bean>
        </property>
        <property name="failureDetectionTimeout" value="60000"/>
        <property name="systemThreadPoolSize" value="128"/>
        <property name="publicThreadPoolSize" value="128"/>
        <property name="queryThreadPoolSize" value="128"/>
        <property name="serviceThreadPoolSize" value="128"/>
        <property name="stripedPoolSize" value="128"/>
        <property name="dataStreamerThreadPoolSize" value="64"/>
        <property name="rebalanceThreadPoolSize" value="8"/>

       
        <property name="peerClassLoadingEnabled" value="true"/>

        <property name="cacheConfiguration">
            <list>
               
                <bean
class="org.apache.ignite.configuration.CacheConfiguration">
                    <property name="name" value="default"/>
                    <property name="atomicityMode" value="ATOMIC"/>
                    <property name="backups" value="1"/>
                </bean>
            </list>
        </property>

       
        <property name="discoverySpi">
            <bean
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                <property name="networkTimeout" value="10000"/>
                <property name="ipFinder">
                    <bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                        <property name="addresses">
                            <list>
                               
                            </list>
                        </property>
                    </bean>
                </property>
            </bean>
        </property>

       
        <property name="dataStorageConfiguration">
            <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
                <property name="defaultDataRegionConfiguration">
                    <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
                        <property name="persistenceEnabled" value="true"/>
                        <property name="checkpointPageBufferSize" value="#{
1L * 1024 * 1024 * 1024}"/>
                        <property name="maxSize" value="#{ 28L * 1024 * 1024
* 1024 }"/>
                    </bean>
                </property>
                <property name="storagePath" value="/data-persist"/>
                <property name="walPath" value="/data-wal"/>
                <property name="walArchivePath" value="/data-wal"/>
                <property name="walMode" value="LOG_ONLY"/>
                <property name="walSegmentSize" value="#{ 128L * 1024 * 1024
}"/>
                <property name="walFlushFrequency" value="5000"/>
                <property name="maxWalArchiveSize" value="#{ 2L * 1024 *
1024 * 1024 }"/>
               
               
                <property name="writeThrottlingEnabled" value="true"/>
                <property name="checkpointFrequency" value="300000"/>
                <property name="checkpointWriteOrder" value="SEQUENTIAL" />
            </bean>
        </property>
    </bean>
</beans>




-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
mmuzaf mmuzaf
Reply | Threaded
Open this post in threaded view
|

Re: Node failure with "Failed to write buffer." error

It seems to me that it is a bug in the implementation when mmap set to
`false` value. I'll try to check.

Just for my curious, can you clarify why the `false` value is used?
According to the comment [1] using mmap=true with the LOG_ONLY mode
shows the best pefromance results.

[1] https://issues.apache.org/jira/browse/IGNITE-6339?focusedCommentId=16281803&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16281803

On Fri, 23 Aug 2019 at 18:15, ihalilaltun <[hidden email]> wrote:

>
> Hi Mmuzaf
>
> IGNITE_WAL_MMAP is false in our environment.
>
> Here is the configuration;
> <?xml version="1.0" encoding="UTF-8"?>
>
> <beans xmlns="http://www.springframework.org/schema/beans"
>        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>        xsi:schemaLocation="
>         http://www.springframework.org/schema/beans
>         http://www.springframework.org/schema/beans/spring-beans.xsd">
>     <bean id="ignite.cfg"
> class="org.apache.ignite.configuration.IgniteConfiguration">
>         <property name="gridLogger">
>             <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
>                 <constructor-arg type="java.lang.String"
> value="/etc/apache-ignite/ignite-log4j2.xml"/>
>             </bean>
>         </property>
>         <property name="communicationSpi">
>             <bean
> class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
>                 <property name="usePairedConnections" value="true"/>
>             </bean>
>         </property>
>         <property name="failureDetectionTimeout" value="60000"/>
>         <property name="systemThreadPoolSize" value="128"/>
>         <property name="publicThreadPoolSize" value="128"/>
>         <property name="queryThreadPoolSize" value="128"/>
>         <property name="serviceThreadPoolSize" value="128"/>
>         <property name="stripedPoolSize" value="128"/>
>         <property name="dataStreamerThreadPoolSize" value="64"/>
>         <property name="rebalanceThreadPoolSize" value="8"/>
>
>
>         <property name="peerClassLoadingEnabled" value="true"/>
>
>         <property name="cacheConfiguration">
>             <list>
>
>                 <bean
> class="org.apache.ignite.configuration.CacheConfiguration">
>                     <property name="name" value="default"/>
>                     <property name="atomicityMode" value="ATOMIC"/>
>                     <property name="backups" value="1"/>
>                 </bean>
>             </list>
>         </property>
>
>
>         <property name="discoverySpi">
>             <bean
> class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
>                 <property name="networkTimeout" value="10000"/>
>                 <property name="ipFinder">
>                     <bean
> class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
>                         <property name="addresses">
>                             <list>
>
>                             </list>
>                         </property>
>                     </bean>
>                 </property>
>             </bean>
>         </property>
>
>
>         <property name="dataStorageConfiguration">
>             <bean
> class="org.apache.ignite.configuration.DataStorageConfiguration">
>                 <property name="defaultDataRegionConfiguration">
>                     <bean
> class="org.apache.ignite.configuration.DataRegionConfiguration">
>                         <property name="persistenceEnabled" value="true"/>
>                         <property name="checkpointPageBufferSize" value="#{
> 1L * 1024 * 1024 * 1024}"/>
>                         <property name="maxSize" value="#{ 28L * 1024 * 1024
> * 1024 }"/>
>                     </bean>
>                 </property>
>                 <property name="storagePath" value="/data-persist"/>
>                 <property name="walPath" value="/data-wal"/>
>                 <property name="walArchivePath" value="/data-wal"/>
>                 <property name="walMode" value="LOG_ONLY"/>
>                 <property name="walSegmentSize" value="#{ 128L * 1024 * 1024
> }"/>
>                 <property name="walFlushFrequency" value="5000"/>
>                 <property name="maxWalArchiveSize" value="#{ 2L * 1024 *
> 1024 * 1024 }"/>
>
>
>                 <property name="writeThrottlingEnabled" value="true"/>
>                 <property name="checkpointFrequency" value="300000"/>
>                 <property name="checkpointWriteOrder" value="SEQUENTIAL" />
>             </bean>
>         </property>
>     </bean>
> </beans>
>
>
>
>
> -----
> İbrahim Halil Altun
> Senior Software Engineer @ Segmentify
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Re: Node failure with "Failed to write buffer." error

This post was updated on .
Hi mmuzaf,

Sorry for late response. When we enabled mmap we had some IO issues, that's
why we diseabled it. If there is such a bug like you said, we can re-enable
mmap.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
mmuzaf mmuzaf
Reply | Threaded
Open this post in threaded view
|

Re: Node failure with "Failed to write buffer." error

Hello,

Dmitry is already working over the issue [1] you faced above.
Probably, it will be included to the next 2.7.6 Apache Ignite release.
According to [2] confluence page, the release will be ready when all
the `blocker` issues will be solved.

Can you also clarify what type the IO issues you faced when `mmap =
true`? Do you have logs? Traces?

[1] https://issues.apache.org/jira/browse/IGNITE-12127
[2] https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+2.7.6

On Mon, 2 Sep 2019 at 12:35, ihalilaltun <[hidden email]> wrote:

>
> Hi mmuzaf,
>
> Sorry for late response. When we enabled mmap we had some IO issues, that's
> why we diseabled it. If there is such a bug like you sad, we can re-enable
> mmap.
>
>
>
> -----
> İbrahim Halil Altun
> Senior Software Engineer @ Segmentify
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Re: Node failure with "Failed to write buffer." error

I am sorry but it has been a long time that we changed the configuration and
we do not have any logs or traces :(
any estimated date for 2.7.6 release?



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
mmuzaf mmuzaf
Reply | Threaded
Open this post in threaded view
|

Re: Node failure with "Failed to write buffer." error

Generally, I don't know.

The estimated date was - 27 august, but by this time there are some
`blocker` issues still open. In my humble opinion, they can be fixed
until the end of September.

On Mon, 2 Sep 2019 at 14:58, ihalilaltun <[hidden email]> wrote:

>
> I am sorry but it has been a long time that we changed the configuration and
> we do not have any logs or traces :(
> any estimated date for 2.7.6 release?
>
>
>
> -----
> İbrahim Halil Altun
> Senior Software Engineer @ Segmentify
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/