Local node SEGMENTED error causing node goes down for no obvious reason

classic Classic list List threaded Threaded
1 message Options
Ray Ray
Reply | Threaded
Open this post in threaded view
|

Local node SEGMENTED error causing node goes down for no obvious reason

I'm running a six nodes Ignite 2.6 cluster.
The config for each server is as follows

    <bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="segmentationPolicy" value="RESTART_JVM"/>
        <property name="peerClassLoadingEnabled" value="true"/>
        <property name="failureDetectionTimeout" value="60000"/>
        <property name="dataStorageConfiguration">
            <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
            <property name="storagePath" value="/data/dc1/ignite"/>
            <property name="walPath" value="/data/da1"/>
            <property name="walArchivePath" value="/data/da1/archive"/>
            <property name="defaultDataRegionConfiguration">
                <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="default_Region"/>
                    <property name="initialSize" value="#{100L * 1024 * 1024
* 1024}"/>
                    <property name="maxSize" value="#{300L * 1024 * 1024 *
1024}"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{8L *
1024 * 1024 * 1024}"/>
                </bean>
            </property>
            <property name="walMode" value="BACKGROUND"/>
            <property name="walFlushFrequency" value="5000"/>
            <property name="checkpointFrequency" value="600000"/>
            </bean>
        </property>
        <property name="discoverySpi">
                <bean
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                    <property name="networkTimeout" value="60000" />
                    <property name="localPort" value="49500"/>
                    <property name="ipFinder">
                        <bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                            <property name="addresses">
                                <list>
                                <value>10.29.42.231:49500</value>
                                <value>10.29.42.233:49500</value>
                                <value>10.29.42.234:49500</value>
                                <value>10.29.42.235:49500</value>
                                <value>10.29.42.236:49500</value>
                                <value>10.29.42.232:49500</value>
                                </list>
                            </property>
                        </bean>
                    </property>
                </bean>
            </property>
            <property name="gridLogger">
            <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
                <constructor-arg type="java.lang.String"
value="config/ignite-log4j2.xml"/>
            </bean>
        </property>
    </bean>
</beans>

I also enabled direct io plugin.

When I try to ingest data into Ignite using Spark dataframe API, the cluster
will be very slow after the Spark driver connects to the cluster and some of
the server nodes will go down eventually with this error:

Local node SEGMENTED: TcpDiscoveryNode
[id=8ce23742-702e-4309-934a-affd80bf3653, addrs=[10.29.42.232, 127.0.0.1],
sockAddrs=[/10.29.42.232:49500, /127.0.0.1:49500], discPort=49500, order=2,
intOrder=2, lastExchangeTime=1541571124026, loc=true,
ver=2.6.0#20180709-sha1:5faffcee, isClient=false]
2018-11-07T06:12:04,032][INFO ][disco-pool-#457][TcpDiscoverySpi] Finished
node ping [nodeId=844fab1e-4189-4f10-bc84-b069bc18a267, res=true, time=6ms]
[2018-11-07T06:12:04,033][ERROR][tcp-disco-srvr-#2][] Critical system error
detected. Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
unexpectedly.
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5687)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[ignite-core-2.6.0.jar:2.6.0]
[2018-11-07T06:12:04,036][ERROR][tcp-disco-srvr-#2][] JVM will be halted
immediately due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]

I examined the GC log and all nodes don't have long GC pause.
The network interconnectivity between all these nodes is fine.

The complete logs for all six servers and client are in the attachment.


From my observation, the PME process when a new thick client in Spark
dataframe API joins topology is very slow and can leads to many problems.
I think the proposal suggested by Nikolay to change thick clients to java
thin clients is a good way to improve this.
http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Spark-Data-Frame-through-Thin-Client-td36814.html

iglog.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/iglog.zip>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/