What is considered high IO wait and partition exchange fiallure?

classic Classic list List threaded Threaded
2 messages Options
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

What is considered high IO wait and partition exchange fiallure?

So if I understand correctly the logs below...

The node that shut off was timing out trying to get partition exchange from the indicated nodes and it shut itself off correct? Does this mean this node was also the master?

1- The time indicated in the log is that UTC?
2- I'm trying to see if it was high IO, but the node that stopped had no high IO.
3- the other nodes had an average peek of about 0.3%-0.5% And that was around 16:30 EST So not sure if those times match with the log.
4- On super high load days we get up to 7%-9% IO Wait but we never lost a node on those occasions.
5- The nodes are persistent nodes so I'm guessing on write there's high IO can we reduce the commit time? but even then I'm not sure it's related. As even on higher load we don't lose the node.

[04:36:07,771][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=132, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=e5645874-77e9-4455-8db5-07fa63984276, addrs=[127.0.0.1, xxx.xxx.xxx.5], sockAddrs=[/xxx.xxx.xxx.5:0, /127.0.0.1:0], discPort=0, order=108, intOrder=60, lastExchangeTime=1600798045116, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765]
[04:36:07,771][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=133, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=133, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=13f473cb-5441-460b-9b9f-a23c1bcf3b0b, addrs=[127.0.0.1, xxx.xxx.xxx.3], sockAddrs=[/127.0.0.1:0, /xxx.xxx.xxx.3:0], discPort=0, order=110, intOrder=61, lastExchangeTime=1600798045127, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=134, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=134, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=ef9f3750-edfa-474e-8e68-6f0abee5095a, addrs=[127.0.0.1, xxx.xxx.xxx.10], sockAddrs=[/127.0.0.1:0, /xxx.xxx.xxx.10:0], discPort=0, order=112, intOrder=62, lastExchangeTime=1600798045137, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=135, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=135, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=e14e5bea-2784-4a6f-af2f-e98a4bf0ab52, addrs=[127.0.0.1, xxx.xxx.xxx.63], sockAddrs=[xxxxxx-0001/xxx.xxx.xxx.63:47500, /127.0.0.1:47500], discPort=47500, order=118, intOrder=65, lastExchangeTime=1600798763474, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=null, durationFromInit=1605328567765]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=136, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=136, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=8facbe9c-6d9f-4c2d-8d9c-5c7f38dbb7da, addrs=[127.0.0.1, xxx.xxx.xxx.69], sockAddrs=[/127.0.0.1:47500, xxxxxx-0003/xxx.xxx.xxx.69:47500], discPort=47500, order=120, intOrder=66, lastExchangeTime=1600799256509, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=null, durationFromInit=1605328567765]
[04:36:07,796][INFO][db-checkpoint-thread-#56%xxxxxx%][GridCacheDatabaseSharedManager] Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=14ms, reason='timeout']
[04:36:07,832][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=carrier-ids-for-phones]
[04:36:07,839][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=twofa-otp]
[04:36:07,840][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=SQL_PUBLIC_CARRIER_CODE]
[04:36:07,841][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=carrier-lookups]
[04:36:07,842][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=ignite-sys-cache]
[04:36:08,497][INFO][node-stopper][IgniteKernal%xxxxxx]

>>> +---------------------------------------------------------------------------------+
>>> Ignite ver. 2.7.0#20181130-sha1:256ae4012cb143b4855b598b740a6f3499ead4db stopped OK
>>> +---------------------------------------------------------------------------------+
>>> Ignite instance name: xxxxxx
>>> Grid uptime: 52 days, 10:25:51.029
Maxim Muzafarov Maxim Muzafarov
Reply | Threaded
Open this post in threaded view
|

Re: What is considered high IO wait and partition exchange fiallure?

Hello John,

Can you share the full log and cluster configuration?
For instance, the cluster may fail the node due to unresponsive socket
request if this node had long GC pauses due to high load.

> Does this mean this node was also the master?

Not sure I've got your question. Do you mean the coordinator node?

On Tue, 17 Nov 2020 at 20:52, John Smith <[hidden email]> wrote:

>
> So if I understand correctly the logs below...
>
> The node that shut off was timing out trying to get partition exchange from the indicated nodes and it shut itself off correct? Does this mean this node was also the master?
>
> 1- The time indicated in the log is that UTC?
> 2- I'm trying to see if it was high IO, but the node that stopped had no high IO.
> 3- the other nodes had an average peek of about 0.3%-0.5% And that was around 16:30 EST So not sure if those times match with the log.
> 4- On super high load days we get up to 7%-9% IO Wait but we never lost a node on those occasions.
> 5- The nodes are persistent nodes so I'm guessing on write there's high IO can we reduce the commit time? but even then I'm not sure it's related. As even on higher load we don't lose the node.
>
> [04:36:07,771][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=132, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=e5645874-77e9-4455-8db5-07fa63984276, addrs=[127.0.0.1, xxx.xxx.xxx.5], sockAddrs=[/xxx.xxx.xxx.5:0, /127.0.0.1:0], discPort=0, order=108, intOrder=60, lastExchangeTime=1600798045116, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765]
> [04:36:07,771][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=133, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=133, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=13f473cb-5441-460b-9b9f-a23c1bcf3b0b, addrs=[127.0.0.1, xxx.xxx.xxx.3], sockAddrs=[/127.0.0.1:0, /xxx.xxx.xxx.3:0], discPort=0, order=110, intOrder=61, lastExchangeTime=1600798045127, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=134, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=134, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=ef9f3750-edfa-474e-8e68-6f0abee5095a, addrs=[127.0.0.1, xxx.xxx.xxx.10], sockAddrs=[/127.0.0.1:0, /xxx.xxx.xxx.10:0], discPort=0, order=112, intOrder=62, lastExchangeTime=1600798045137, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=135, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=135, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=e14e5bea-2784-4a6f-af2f-e98a4bf0ab52, addrs=[127.0.0.1, xxx.xxx.xxx.63], sockAddrs=[xxxxxx-0001/xxx.xxx.xxx.63:47500, /127.0.0.1:47500], discPort=47500, order=118, intOrder=65, lastExchangeTime=1600798763474, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=null, durationFromInit=1605328567765]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=136, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=136, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=8facbe9c-6d9f-4c2d-8d9c-5c7f38dbb7da, addrs=[127.0.0.1, xxx.xxx.xxx.69], sockAddrs=[/127.0.0.1:47500, xxxxxx-0003/xxx.xxx.xxx.69:47500], discPort=47500, order=120, intOrder=66, lastExchangeTime=1600799256509, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=null, durationFromInit=1605328567765]
> [04:36:07,796][INFO][db-checkpoint-thread-#56%xxxxxx%][GridCacheDatabaseSharedManager] Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=14ms, reason='timeout']
> [04:36:07,832][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=carrier-ids-for-phones]
> [04:36:07,839][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=twofa-otp]
> [04:36:07,840][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=SQL_PUBLIC_CARRIER_CODE]
> [04:36:07,841][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=carrier-lookups]
> [04:36:07,842][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=ignite-sys-cache]
> [04:36:08,497][INFO][node-stopper][IgniteKernal%xxxxxx]
>
> >>> +---------------------------------------------------------------------------------+
> >>> Ignite ver. 2.7.0#20181130-sha1:256ae4012cb143b4855b598b740a6f3499ead4db stopped OK
> >>> +---------------------------------------------------------------------------------+
> >>> Ignite instance name: xxxxxx
> >>> Grid uptime: 52 days, 10:25:51.029