Starvation in striped pool

classic Classic list List threaded Threaded
14 messages Options
maheshkr76private maheshkr76private
Reply | Threaded
Open this post in threaded view
|

Starvation in striped pool

hi,

We are seeing a crash of the server node. This is with Ignite 2.7.6
Any insight and a possible workaround would be deeply appreciated?
attached is the log ignite-29268843.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1596/ignite-29268843.zip>  


*
[11:16:09,531][WARNING][grid-timeout-worker-#27][G] >>> Possible starvation
in striped pool.*
    Thread name: sys-stripe-4-#5
    Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE,
topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false,
msg=GridNearSingleGetRequest [futId=1570427398413, key=KeyCacheObjectImpl
[part=212, val=null, hasValBytes=true], flags=1,
topVer=AffinityTopologyVersion [topVer=618, minorTopVer=4],
subjId=b3a51345-c6a5-4a6f-9117-374a80948ae8, taskNameHash=0, createTtl=-1,
accessTtl=-1, mvccSnapshot=null]]], Message closure [msg=GridIoMessage
[plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0,
skipOnTimeout=false, msg=GridNearSingleGetRequest [futId=1570427398737,
key=KeyCacheObjectImpl [part=212, val=null, hasValBytes=true], flags=1,
topVer=AffinityTopologyVersion [topVer=618, minorTopVer=4],
subjId=b3a51345-c6a5-4a6f-9117-374a80948ae8, taskNameHash=0, createTtl=-1,
accessTtl=-1, mvccSnapshot=null]]]]
    Deadlock: false
    Completed: 10057



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

Hello!

Thread [name="sys-stripe-4-#5", id=27, state=RUNNABLE, blockCnt=0, waitCnt=10004]
        at sun.nio.ch.Net.poll(Native Method)
        at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)
        at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3299)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
        at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)

Most likely this would mean that there is no communication connectivity between your nodes, e.g., your server node can't open connection to your client node to send over results of operation in striped pool.

Please check your network.

Regards,
--
Ilya Kasnacheev


вт, 8 окт. 2019 г. в 15:13, maheshkr76private <[hidden email]>:
hi,

We are seeing a crash of the server node. This is with Ignite 2.7.6
Any insight and a possible workaround would be deeply appreciated?
attached is the log ignite-29268843.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1596/ignite-29268843.zip


*
[11:16:09,531][WARNING][grid-timeout-worker-#27][G] >>> Possible starvation
in striped pool.*
    Thread name: sys-stripe-4-#5
    Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE,
topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false,
msg=GridNearSingleGetRequest [futId=1570427398413, key=KeyCacheObjectImpl
[part=212, val=null, hasValBytes=true], flags=1,
topVer=AffinityTopologyVersion [topVer=618, minorTopVer=4],
subjId=b3a51345-c6a5-4a6f-9117-374a80948ae8, taskNameHash=0, createTtl=-1,
accessTtl=-1, mvccSnapshot=null]]], Message closure [msg=GridIoMessage
[plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0,
skipOnTimeout=false, msg=GridNearSingleGetRequest [futId=1570427398737,
key=KeyCacheObjectImpl [part=212, val=null, hasValBytes=true], flags=1,
topVer=AffinityTopologyVersion [topVer=618, minorTopVer=4],
subjId=b3a51345-c6a5-4a6f-9117-374a80948ae8, taskNameHash=0, createTtl=-1,
accessTtl=-1, mvccSnapshot=null]]]]
    Deadlock: false
    Completed: 10057



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
maheshkr76private maheshkr76private
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

OK. There could have been a temporary network issue between the server and
client node.

However, I was expecting the server node to throw the client out of the
cluster and resume normal functioning. But what bothers me is that the
server node never recovered after the network issue and finally got stuck in
dumping threads. If there was a bigger network issue, I would have expected
the server to report segmentation and go to the failure handler. This did
not happen.

Feels like the handling of this scenario is not proper in ignite code.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

Hello!

If client node continues to respond via Discovery, server node is not going to drop it by unreachability. This is the default behavior.

Regards,
--
Ilya Kasnacheev


вт, 8 окт. 2019 г. в 15:43, maheshkr76private <[hidden email]>:
OK. There could have been a temporary network issue between the server and
client node.

However, I was expecting the server node to throw the client out of the
cluster and resume normal functioning. But what bothers me is that the
server node never recovered after the network issue and finally got stuck in
dumping threads. If there was a bigger network issue, I would have expected
the server to report segmentation and go to the failure handler. This did
not happen.

Feels like the handling of this scenario is not proper in ignite code.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
maheshkr76private maheshkr76private
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

Hello Ilya
Once connection goes bad between client and server, which configuration
parameter on the thick client-side would force stop the thick client from
pinging server...
Tried join timeout, connection timeouts, in communicaiton and discovery spis
and nothing ever worked.


Have seen in a few scenarios, a think client never stops pinging the server
if the network connection went bad for any reason. Continuously see these
pings on server and the entire cluster slows down.




[06:20:16,203][INFO][grid-nio-worker-tcp-comm-7-#47][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.22:32935]
[06:20:16,203][INFO][grid-nio-worker-tcp-comm-7-#47][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]
[06:20:16,405][INFO][grid-nio-worker-tcp-comm-0-#40][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.22:38152]
[06:20:16,405][INFO][grid-nio-worker-tcp-comm-0-#40][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]
[06:20:16,607][INFO][grid-nio-worker-tcp-comm-1-#41][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.22:49095]
[06:20:16,607][INFO][grid-nio-worker-tcp-comm-1-#41][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]



This issue was reported here.
http://apache-ignite-users.70518.x6.nabble.com/Ignite-2-7-0-thick-client-has-all-system-threads-busy-indefinitely-td28880.html#a29018








--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

Hello!

It's hard to say what happens here. What timeout settings do you have? Can you provide complete log from client node as well?

Regards,
--
Ilya Kasnacheev


вт, 8 окт. 2019 г. в 19:25, maheshkr76private <[hidden email]>:
Hello Ilya
Once connection goes bad between client and server, which configuration
parameter on the thick client-side would force stop the thick client from
pinging server...
Tried join timeout, connection timeouts, in communicaiton and discovery spis
and nothing ever worked.


Have seen in a few scenarios, a think client never stops pinging the server
if the network connection went bad for any reason. Continuously see these
pings on server and the entire cluster slows down.




[06:20:16,203][INFO][grid-nio-worker-tcp-comm-7-#47][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.22:32935]
[06:20:16,203][INFO][grid-nio-worker-tcp-comm-7-#47][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]
[06:20:16,405][INFO][grid-nio-worker-tcp-comm-0-#40][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.22:38152]
[06:20:16,405][INFO][grid-nio-worker-tcp-comm-0-#40][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]
[06:20:16,607][INFO][grid-nio-worker-tcp-comm-1-#41][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.22:49095]
[06:20:16,607][INFO][grid-nio-worker-tcp-comm-1-#41][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]



This issue was reported here.
http://apache-ignite-users.70518.x6.nabble.com/Ignite-2-7-0-thick-client-has-all-system-threads-busy-indefinitely-td28880.html#a29018








--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
maheshkr76private maheshkr76private
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

Attached are the logs. In the server log, you will see the thick client
continuously pinging server indefinitely...there is no recovery of the thick
client. SO, the problem is, we can't even reboot the thick client in a
production scenario, as it does even fail (meaning, the configured failure
handler is not even triggered)

thickclient-prining-always.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1596/thickclient-prining-always.zip>  


Below is the thick client configuration that we use

Ignite thick client configuration...
IgniteConfiguration [igniteInstanceName=null, pubPoolSize=8,
svcPoolSize=null, callbackPoolSize=8, stripedPoolSize=8, sysPoolSize=8,
mgmtPoolSize=4, igfsPoolSize=1, dataStreamerPoolSize=8,
utilityCachePoolSize=8, utilityCacheKeepAliveTime=60000, p2pPoolSize=2,
qryPoolSize=8, igniteHome=null, igniteWorkDir=null, mbeanSrv=null,
nodeId=null, marsh=null, marshLocJobs=false, daemon=false, p2pEnabled=true,
netTimeout=60000, sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=100,
metricsUpdateFreq=2000, metricsExpTime=9223372036854775807,
discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=0, ackTimeout=0,
marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=600000,
forceSrvMode=false, clientReconnectDisabled=true, internalLsnr=null],
segPlc=STOP, segResolveAttempts=2, waitForSegOnStart=true,
allResolversPassReq=true, segChkFreq=10000, commSpi=TcpCommunicationSpi
[connectGate=null,
connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@74e47d7c,
enableForcibleNodeKill=false, enableTroubleshootingLog=false, locAddr=null,
locHost=null, locPort=47100, locPortRange=100, shmemPort=-1, directBuf=true,
directSndBuf=true, idleConnTimeout=600000, connTimeout=5000,
maxConnTimeout=48000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768,
msgQueueLimit=1000, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null,
usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true,
filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0,
sockWriteTimeout=2000, boundTcpPort=-1, boundTcpShmemPort=-1,
selectorsCnt=4, selectorSpins=0, addrRslvr=null,
ctxInitLatch=java.util.concurrent.CountDownLatch@65072e7d[Count = 1],
stopping=false], evtSpi=null, colSpi=null, deploySpi=null, indexingSpi=null,
addrRslvr=null, encryptionSpi=null, clientMode=true,
rebalanceThreadPoolSize=1, txCfg=TransactionConfiguration
[txSerEnabled=false, dfltIsolation=REPEATABLE_READ,
dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0,
txTimeoutOnPartitionMapExchange=0, pessimisticTxLogSize=0,
pessimisticTxLogLinger=10000, tmLookupClsName=null, txManagerFactory=null,
useJtaSync=false], cacheSanityCheckEnabled=true, discoStartupDelay=60000,
deployMode=SHARED, p2pMissedCacheSize=100, locHost=null,
timeSrvPortBase=31100, timeSrvPortRange=100, failureDetectionTimeout=10000,
sysWorkerBlockedTimeout=120000, clientFailureDetectionTimeout=120000,
metricsLogFreq=3000, hadoopCfg=null, connectorCfg=ConnectorConfiguration
[jettyPath=null, host=null, port=11211, noDelay=true, directBuf=false,
sndBufSize=32768, rcvBufSize=32768, idleQryCurTimeout=600000,
idleQryCurCheckFreq=60000, sndQueueLimit=0, selectorCnt=1, idleTimeout=7000,
sslEnabled=false, sslClientAuth=false, sslCtxFactory=null, sslFactory=null,
portRange=100, threadPoolSize=8, msgInterceptor=null], odbcCfg=null,
warmupClos=null, atomicCfg=AtomicConfiguration [seqReserveSize=1000,
cacheMode=PARTITIONED, backups=1, aff=null, grpName=null], classLdr=null,
sslCtxFactory=null, platformCfg=null, binaryCfg=null, memCfg=null,
pstCfg=null, dsCfg=null, activeOnStart=true, autoActivation=true,
longQryWarnTimeout=3000, sqlConnCfg=null,
cliConnCfg=ClientConnectorConfiguration [host=null, port=10800,
portRange=100, sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true,
maxOpenCursorsPerConn=128, threadPoolSize=8, idleTimeout=0,
jdbcEnabled=true, odbcEnabled=true, thinCliEnabled=true, sslEnabled=false,
useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null],
mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, authEnabled=false,
failureHnd=net.aline.cloudedh.base.database.IgniteDatabase$$Lambda$19/1447342142@5fafe841,
commFailureRslvr=null]





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
maheshkr76private maheshkr76private
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

Ilya.

What is most mysterious to me is, I disabled reconnect of think client
(clientReconnectDisabled=true). Still the server prints, the below, where
the same thick client is making an immediate attempt to reconnect back to
the cluster, while the previous connecting isn't still successful.  

[00:47:46,004][INFO][grid-nio-worker-tcp-comm-1-#29][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.71:2007]
*[00:47:46,004][INFO][grid-nio-worker-tcp-comm-1-#29][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=3240558e-72b4-4314-a970-0965654e7e6f, locNodeOrder=1,
rmtNode=e1be7b6f-8691-4f81-a03f-c77ff14843ef, rmtNodeOrder=675]*

This I feel is an internal ignite code logic, which does not seem to be
influenced by the user configuration of IgniteConfiguration params. Can you
cross-check this? My point is, the same node should have to attempt a
reconnect when the user has set clientReconnectDisabled = true in
IgniteConfiguration



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

Hello!

I'm almost certain is that problem is that server node cannot open connection to client node (and while it tries, it will reject connection attempts from client node)

clientReconnectDisabled=true will only concern discovery. In your case, there's no problems with discovery, the problem is with communication (port 47100). Unless discovery fails, node will not be dropped.

Regards,
--
Ilya Kasnacheev


чт, 10 окт. 2019 г. в 08:48, maheshkr76private <[hidden email]>:
Ilya.

What is most mysterious to me is, I disabled reconnect of think client
(clientReconnectDisabled=true). Still the server prints, the below, where
the same thick client is making an immediate attempt to reconnect back to
the cluster, while the previous connecting isn't still successful. 

[00:47:46,004][INFO][grid-nio-worker-tcp-comm-1-#29][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.71:2007]
*[00:47:46,004][INFO][grid-nio-worker-tcp-comm-1-#29][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=3240558e-72b4-4314-a970-0965654e7e6f, locNodeOrder=1,
rmtNode=e1be7b6f-8691-4f81-a03f-c77ff14843ef, rmtNodeOrder=675]*

This I feel is an internal ignite code logic, which does not seem to be
influenced by the user configuration of IgniteConfiguration params. Can you
cross-check this? My point is, the same node should have to attempt a
reconnect when the user has set clientReconnectDisabled = true in
IgniteConfiguration



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
maheshkr76private maheshkr76private
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

>>I'm almost certain is that problem is that server node cannot open a
connection to client node (and while it tries, it will reject connection
attempts from client node)

The default idleTimeout of TCP communication spi is 6 minutes. So I assume,
after this timeout, the connection is closed and restarted probably later on
a request from the client.
So I can imagine, this issue happening, when the client and server are
trying to re-establish the connection and your explanation makes sense.

However, my concern still remains.  The server has plenty of timeouts in its
communication SPI. Why wouldn't the server, not throw that client out of the
cluster and let the client fail gracefully? This incessant pinging by client
to the server is a problem in production environments.

Currently, the only way out for me seems to be set
 >>>    communicationSpi.setIdleConnectionTimeout(Long.MAX_VALUE);
testing is currently and seems to be holding up for 22 hours now.

Do you see any issues with setting idleConnectionTimeout that high?



 




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

Hello!

I don't think there is any problem with idleConnectionTimeout, but you should not use nodes which are not mutually connectible to each other anyway.

I can't really comment on the feasibility of dropping client when it can't be reached via Communication. You can start a discussion about it on developers list.

Regards,
--
Ilya Kasnacheev


пт, 11 окт. 2019 г. в 11:31, maheshkr76private <[hidden email]>:
>>I'm almost certain is that problem is that server node cannot open a
connection to client node (and while it tries, it will reject connection
attempts from client node)

The default idleTimeout of TCP communication spi is 6 minutes. So I assume,
after this timeout, the connection is closed and restarted probably later on
a request from the client.
So I can imagine, this issue happening, when the client and server are
trying to re-establish the connection and your explanation makes sense.

However, my concern still remains.  The server has plenty of timeouts in its
communication SPI. Why wouldn't the server, not throw that client out of the
cluster and let the client fail gracefully? This incessant pinging by client
to the server is a problem in production environments.

Currently, the only way out for me seems to be set
 >>>    communicationSpi.setIdleConnectionTimeout(Long.MAX_VALUE);
testing is currently and seems to be holding up for 22 hours now.

Do you see any issues with setting idleConnectionTimeout that high?








--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

Hi Ilya,

From time to time, we have faced exactly the same problem. Is there any best
practices for handling network issues? What i mean is, if there is any
network issues between client/s and server/s we want the cluster keeps
living. As for the clients, they can be disconnected from servers.

Regards.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

current setting idleTImeout = LONG_MAX in tcp communication spi has solved the problem for us. 


ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Starvation in striped pool

In reply to this post by ihalilaltun
Hello!

We, we had IGNITE_ENABLE_FORCIBLE_NODE_KILL, but the best solution is, in my opinion, to avoid adding anything unstable to the cluster.

Regards,
--
Ilya Kasnacheev


пт, 18 окт. 2019 г. в 08:35, ihalilaltun <[hidden email]>:
Hi Ilya,

From time to time, we have faced exactly the same problem. Is there any best
practices for handling network issues? What i mean is, if there is any
network issues between client/s and server/s we want the cluster keeps
living. As for the clients, they can be disconnected from servers.

Regards.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/