Local node SEGMENTED error causing node goes down for no obvious reason

classic Classic list List threaded Threaded
3 messages Options
Ray Ray
Reply | Threaded
Open this post in threaded view
|

Local node SEGMENTED error causing node goes down for no obvious reason

I'm running a six nodes Ignite 2.6 cluster.
The config for each server is as follows

    <bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="segmentationPolicy" value="RESTART_JVM"/>
        <property name="peerClassLoadingEnabled" value="true"/>
        <property name="failureDetectionTimeout" value="60000"/>
        <property name="dataStorageConfiguration">
            <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
            <property name="storagePath" value="/data/dc1/ignite"/>
            <property name="walPath" value="/data/da1"/>
            <property name="walArchivePath" value="/data/da1/archive"/>
            <property name="defaultDataRegionConfiguration">
                <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="default_Region"/>
                    <property name="initialSize" value="#{100L * 1024 * 1024
* 1024}"/>
                    <property name="maxSize" value="#{300L * 1024 * 1024 *
1024}"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{8L *
1024 * 1024 * 1024}"/>
                </bean>
            </property>
            <property name="walMode" value="BACKGROUND"/>
            <property name="walFlushFrequency" value="5000"/>
            <property name="checkpointFrequency" value="600000"/>
            </bean>
        </property>
        <property name="discoverySpi">
                <bean
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                    <property name="networkTimeout" value="60000" />
                    <property name="localPort" value="49500"/>
                    <property name="ipFinder">
                        <bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                            <property name="addresses">
                                <list>
                                <value>10.29.42.231:49500</value>
                                <value>10.29.42.233:49500</value>
                                <value>10.29.42.234:49500</value>
                                <value>10.29.42.235:49500</value>
                                <value>10.29.42.236:49500</value>
                                <value>10.29.42.232:49500</value>
                                </list>
                            </property>
                        </bean>
                    </property>
                </bean>
            </property>
            <property name="gridLogger">
            <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
                <constructor-arg type="java.lang.String"
value="config/ignite-log4j2.xml"/>
            </bean>
        </property>
    </bean>
</beans>

I also enabled direct io plugin.

When I try to ingest data into Ignite using Spark dataframe API, the cluster
will be very slow after the Spark driver connects to the cluster and some of
the server nodes will go down eventually with this error:

Local node SEGMENTED: TcpDiscoveryNode
[id=8ce23742-702e-4309-934a-affd80bf3653, addrs=[10.29.42.232, 127.0.0.1],
sockAddrs=[/10.29.42.232:49500, /127.0.0.1:49500], discPort=49500, order=2,
intOrder=2, lastExchangeTime=1541571124026, loc=true,
ver=2.6.0#20180709-sha1:5faffcee, isClient=false]
2018-11-07T06:12:04,032][INFO ][disco-pool-#457][TcpDiscoverySpi] Finished
node ping [nodeId=844fab1e-4189-4f10-bc84-b069bc18a267, res=true, time=6ms]
[2018-11-07T06:12:04,033][ERROR][tcp-disco-srvr-#2][] Critical system error
detected. Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
unexpectedly.
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5687)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[ignite-core-2.6.0.jar:2.6.0]
[2018-11-07T06:12:04,036][ERROR][tcp-disco-srvr-#2][] JVM will be halted
immediately due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]

I examined the GC log and all nodes don't have long GC pause.
The network interconnectivity between all these nodes is fine.

The complete logs for all six servers and client are in the attachment.


From my observation, the PME process when a new thick client in Spark
dataframe API joins topology is very slow and can leads to many problems.
I think the proposal suggested by Nikolay to change thick clients to java
thin clients is a good way to improve this.
http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Spark-Data-Frame-through-Thin-Client-td36814.html

iglog.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/iglog.zip>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Local node SEGMENTED error causing node goes down for no obvious reason

Hello!

I can see that at 2018-11-07T07:54:47 five new clients will suddently arrive
and then five server nodes will drop (not being able to acknowledge addition
of new nodes perhaps?).

Either there is communication problems between new clients and old servers,
or some bug manifested by rapid inclusion of multiple clients.

Regards,



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
yakov yakov
Reply | Threaded
Open this post in threaded view
|

Re: Local node SEGMENTED error causing node goes down for no obvious reason

It seems there were issues with network. You can see plenty of discovery warnings in logs of the kind (231.log):

[2018-11-07T07:44:44,627][WARN ][grid-timeout-worker-#119][TcpDiscoverySpi] Socket write has timed out (consider increasing 'IgniteConfiguration.failureDetectionTimeout' configuration property) [failureDetectionTimeout=60000, rmtAddr=/10.29.42.232:49500, rmtPort=49500, sockTimeout=5000]
[2018-11-07T07:44:44,630][WARN ][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Failed to send message to next node [msg=TcpDiscoveryClientReconnectMessage [routerNodeId=9a4ee928-a71d-484b-88cc-2ded8efb7b1d, lastMsgId=null, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=005b10de661-88aac721-2b85-432b-b703-ca6aff5252c6, verifierNodeId=a7685ff7-78b6-442c-a819-f8a5b2547623, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=e940c0d9-15f7-46ab-be95-ee2302ccc8f4, addrs=[10.29.42.232], sockAddrs=[/10.29.42.232:49500], discPort=49500, order=2, intOrder=2, lastExchangeTime=1541574587361, loc=false, ver=2.6.0#20180709-sha1:5faffcee, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryClientReconnectMessage [routerNodeId=9a4ee928-a71d-484b-88cc-2ded8efb7b1d, lastMsgId=null, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=005b10de661-88aac721-2b85-432b-b703-ca6aff5252c6, verifierNodeId=a7685ff7-78b6-442c-a819-f8a5b2547623, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=e940c0d9-15f7-46ab-be95-ee2302ccc8f4, order=2, addr=[10.29.42.232], daemon=false]]]

Node with order = 1 (231 one) kicked other server nodes out of the topology for some reason we need to figure out.

What environment do you run Ignite on?

One more strange thing I see in logs is this

[2018-11-07T07:24:55,542][INFO ][tcp-disco-sock-reader-#146][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/10.29.42.231:28977, rmtPort=28977]
[2018-11-07T07:24:55,547][INFO ][exchange-worker-#162][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=20, minorTopVer=0], crd=true, evt=NODE_JOINED, evtNode=23c738f5-fbbf-44dc-a5fb-5d09933d9c4b, customEvt=null, allowMerge=true]

But I do not see Node Added event log for node with this ID. Message should be like this;
[2018-11-07T07:20:56,050][INFO ][disco-event-worker-#161][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=a4acc241-dfa6-44bd-a62b-ebce2f68d199, addrs=[10.29.42.49, 127.0.0.1], sockAddrs=[/10.29.42.49:0, /127.0.0.1:0], discPort=0, order=14, intOrder=13, lastExchangeTime=1541575169947, loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=true]

Ilya, canyou please take a look at the logs one more time?

Ray, can you please reproduce the issue with DEBUG turned on for discovery? Please also fix all the warnings of the kind
[2018-11-07T07:11:37,701][WARN ][disco-event-worker-#161][GridDiscoveryManager] Local node's value of 'java.net.preferIPv4Stack' system property differs from remote node's (all nodes in topology should have identical value) [locPreferIpV4=null, rmtPreferIpV4=null, locId8=e940c0d9, rmtId8=7e10c6c4, rmtAddrs=[sap-zookeeper3/10.29.42.43, /127.0.0.1], rmtNode=ClusterNode [id=7e10c6c4-3137-4640-a673-71f30d66d0e3, order=8, addr=[10.29.42.43, 127.0.0.1], daemon=false]]

--Yakov