Production outage - Join process time out

classic Classic list List threaded Threaded
2 messages Options
sparkle_j sparkle_j
Reply | Threaded
Open this post in threaded view
|

Production outage - Join process time out

This post has NOT been accepted by the mailing list yet.
Ignite Community, would you pleas help us diagnose a production outage. We are on ignite 1.5.0-final version. Some of our clients are unable to connect to grid and throw Join Timeout Exception. This is very intermittent and not all clients are have this problem, but a few at irregular times and we cannot replicate.

Here is our configuration :

JVM heap size 10 GB for each node. 16 nodes total.

Overnight few of our clients were not able to connect to the grid. Most clients were fine.
We checked JVM utilization on JMX for each node, memory was under-utilized.

Cache configuration snippet.

<property name="backups" value="1"/>
<property name="startSize" value="#{1 * 1024 * 1024}"/> 
<property name="memoryMode" value="OFFHEAP_TIERED"/>
<property name="offHeapMaxMemory" value="#{10 * 1024L * 1024L * 1024L}"/>


Have you guys seen this before? Here are our settings from the client config file and the error is below.

<property name="discoverySpi">
            <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                                                <property name="joinTimeout" value="30000"/>
                <property name="ackTimeout" value="30000"/>
                                <property name="maxAckTimeout" value="60000"/>           
                                                <property name="reconnectCount" value="5"/>
                                                <property name="ipFinder">
                   
                   
                    <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                        <property name="addresses">
                            <list>                           
                               
                                <value>grid-tp1-prod:47500..47509</value>
                                <value>grid-tp2-prod:47500..47509</value>
                                <value>grid-tp3-prod:47500..47509</value>
                                <value>grid-tp4-prod:47500..47509</value>                               
                                        </list>
                        </property>
                    </bean>
                </property>
            </bean>
        </property>


Error Logs

2016-06-13 18:58:11,657 ERROR orderserver.client.GridClient (GridClient.java:174) - class org.apache.ignite.IgniteException: Failed to start manager: GridManagerAdapter [enabled=true, name=org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
class org.apache.ignite.IgniteException: Failed to start manager: GridManagerAdapter [enabled=true, name=org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
        at org.apache.ignite.internal.util.IgniteUtils.convertException(IgniteUtils.java:906)
        at org.apache.ignite.Ignition.start(Ignition.java:350)
        at com.tudor.datagridI.TradingDataAccessImpl.<init>(TradingDataAccessImpl.java:104)
        at com.tudor.datagridI.DataGridClient.getTradingDataAccess(DataGridClient.java:16)
        at orderserver.client.GridClient.getTradingDataAccess(GridClient.java:94)
        at orderserver.client.GridClient.updateOrderInGrid(GridClient.java:164)
        at orderserver.OrderFactory.saveOrders(OrderFactory.java:5683)
        at com.tudor.fix.processor.SaveOrders.saveOrders(SaveOrders.java:124)
        at com.tudor.fix.processor.SaveOrders.saveOrders(SaveOrders.java:94)
        at com.tudor.fix.processor.SaveOrders.transform(SaveOrders.java:38)
        at com.tudor.fix.transformer.CompositeFilteringFixStateTransformer.transform(CompositeFilteringFixStateTransformer.java:59)
        at com.tudor.fix.transformer.ReportingBatchTransformer.transform(ReportingBatchTransformer.java:74)
        at com.tudor.fix.transformer.BatchingFixStateTransformer.batch(BatchingFixStateTransformer.java:158)
        at com.tudor.fix.transformer.BatchingFixStateTransformer.transform(BatchingFixStateTransformer.java:104)
        at com.tudor.fix.service.MessageKeeper.run(MessageKeeper.java:107)
        at java.lang.Thread.run(Thread.java:745)
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to start manager: GridManagerAdapter [enabled=true, name=org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
        at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1536)
        at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:897)
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:1736)
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1589)
        at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1042)
        at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:964)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:850)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:749)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:619)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:589)
        at org.apache.ignite.Ignition.start(Ignition.java:347)
        ... 14 more
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to start SPI: TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000, ackTimeout=30000, reconCnt=5, maxAckTimeout=60000, forceSrvMode=false, clientReconnectDisabled=false]
        at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:258)
        at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:677)
        at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1531)
        ... 24 more
Caused by: class org.apache.ignite.spi.IgniteSpiException: Join process timed out, did not receive response for join request (consider increasing 'joinTimeout' configuration property) [joinTimeout=30000, sock=Socket[addr=grid-tp2-prod/10.22.50.41,port=47503,localport=38191]]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1335)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)


Appreciate your help.

Thanks,
Sparkle.
Denis Magda Denis Magda
Reply | Threaded
Open this post in threaded view
|

Re: Production outage - Join process time out

Hi,

Please properly subscribe to the user list (this way we will not have to manually approve your emails) if you want to get answers from the community earlier. All you need to do is send an email to ì user-subscribe@ignite.apache.orgî and follow simple instructions in the reply.


Is there any particular reason why you set the following low level settings for TcpDiscoverySpi? Why have you set these values?
                <property name="joinTimeout" value="30000"/>
                <property name="ackTimeout" value="30000"/>
                <property name="maxAckTimeout" value="60000"/>           
                <property name="reconnectCount" value="5"/>

If you observe high latencies in your network then you need to increase 'socketWriteTimeout' as well. The following also can be a reason of the issue:
- long GC pauses on servers or clients side. Check GC logs - https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats;
- not enough throughput in the network at some periods. I would suggest removing all these low level settings done for TcpDiscvorySpi and set IgniteConfiguration.failureDetectionTimeout instead (preferably on all the nodes). Low level tuning of TcpDiscoverySpi is needed in rare cases.

--
Denis
 
------
Ignite Community, would you pleas help us diagnose a production outage. We are on ignite 1.5.0-final version. Some of our clients are unable to connect to grid and throw Join Timeout Exception. This is very intermittent and not all clients are have this problem, but a few at irregular times and we cannot replicate.

Here is our configuration :

JVM heap size 10 GB for each node. 16 nodes total.

Overnight few of our clients were not able to connect to the grid. Most clients were fine.
We checked JVM utilization on JMX for each node, memory was under-utilized.

Cache configuration snippet.

<property name="backups" value="1"/>
<property name="startSize" value="#{1 * 1024 * 1024}"/> 
<property name="memoryMode" value="OFFHEAP_TIERED"/>
<property name="offHeapMaxMemory" value="#{10 * 1024L * 1024L * 1024L}"/>


Have you guys seen this before? Here are our settings from the client config file and the error is below.

<property name="discoverySpi">
            <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                                                <property name="joinTimeout" value="30000"/>
                <property name="ackTimeout" value="30000"/>
                                <property name="maxAckTimeout" value="60000"/>           
                                                <property name="reconnectCount" value="5"/>
                                                <property name="ipFinder">
                   
                   
                    <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                        <property name="addresses">
                            <list>                           
                               
                                <value>grid-tp1-prod:47500..47509</value>
                                <value>grid-tp2-prod:47500..47509</value>
                                <value>grid-tp3-prod:47500..47509</value>
                                <value>grid-tp4-prod:47500..47509</value>                               
                                        </list>
                        </property>
                    </bean>
                </property>
            </bean>
        </property>


Error Logs

2016-06-13 18:58:11,657 ERROR orderserver.client.GridClient (GridClient.java:174) - class org.apache.ignite.IgniteException: Failed to start manager: GridManagerAdapter [enabled=true, name=org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
class org.apache.ignite.IgniteException: Failed to start manager: GridManagerAdapter [enabled=true, name=org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
        at org.apache.ignite.internal.util.IgniteUtils.convertException(IgniteUtils.java:906)
        at org.apache.ignite.Ignition.start(Ignition.java:350)
        at com.tudor.datagridI.TradingDataAccessImpl.<init>(TradingDataAccessImpl.java:104)
        at com.tudor.datagridI.DataGridClient.getTradingDataAccess(DataGridClient.java:16)
        at orderserver.client.GridClient.getTradingDataAccess(GridClient.java:94)
        at orderserver.client.GridClient.updateOrderInGrid(GridClient.java:164)
        at orderserver.OrderFactory.saveOrders(OrderFactory.java:5683)
        at com.tudor.fix.processor.SaveOrders.saveOrders(SaveOrders.java:124)
        at com.tudor.fix.processor.SaveOrders.saveOrders(SaveOrders.java:94)
        at com.tudor.fix.processor.SaveOrders.transform(SaveOrders.java:38)
        at com.tudor.fix.transformer.CompositeFilteringFixStateTransformer.transform(CompositeFilteringFixStateTransformer.java:59)
        at com.tudor.fix.transformer.ReportingBatchTransformer.transform(ReportingBatchTransformer.java:74)
        at com.tudor.fix.transformer.BatchingFixStateTransformer.batch(BatchingFixStateTransformer.java:158)
        at com.tudor.fix.transformer.BatchingFixStateTransformer.transform(BatchingFixStateTransformer.java:104)
        at com.tudor.fix.service.MessageKeeper.run(MessageKeeper.java:107)
        at java.lang.Thread.run(Thread.java:745)
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to start manager: GridManagerAdapter [enabled=true, name=org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
        at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1536)
        at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:897)
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:1736)
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1589)
        at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1042)
        at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:964)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:850)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:749)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:619)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:589)
        at org.apache.ignite.Ignition.start(Ignition.java:347)
        ... 14 more
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to start SPI: TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000, ackTimeout=30000, reconCnt=5, maxAckTimeout=60000, forceSrvMode=false, clientReconnectDisabled=false]
        at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:258)
        at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:677)
        at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1531)
        ... 24 more
Caused by: class org.apache.ignite.spi.IgniteSpiException: Join process timed out, did not receive response for join request (consider increasing 'joinTimeout' configuration property) [joinTimeout=30000, sock=Socket[addr=grid-tp2-prod/10.22.50.41,port=47503,localport=38191]]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1335)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)


Appreciate your help.

Thanks,
Sparkle.