IgniteCache.size() is hanging

classic Classic list List threaded Threaded
7 messages Options
arw180 arw180
Reply | Threaded
Open this post in threaded view
|

IgniteCache.size() is hanging

Selected post
I'm running a 5 node Ignite cluster, version 2.8.1, with persistence enabled
and a small number of partitioned caches, ranging between a few thousand
records to one cache with over 1 billion records. No SQL use.

When I run a Java client app and connect to the cluster (with clientMode =
true), I connect fine and can retrieve the names of all caches on the
cluster quickly. However, attempting to get the size of a cache via
ignite.getOrCreateCache("existingCacheName").size() just hangs. This happens
regardless of which cache I try to get the size of.

Sometimes I see a suspicious warning after a minute or so: WARNING: Node FAILED:
TcpDiscoveryNode[...] - it appears to be referencing my client node. I don't know why the node failed, what to do about it, or why it seems to happen so frequently. There are no
relevant logs coming from any of the ignite server nodes, nor the java app/client.

There are also many times when I do not get a Node FAILED warning, but still the size() operation just hangs with no other information.

Thanks for your help!

Alan
aealexsandrov aealexsandrov
Reply | Threaded
Open this post in threaded view
|

Re: IgniteCache.size() is hanging

Hi,

Can you please provide the full server logs?

BR,
Andrei



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
arw180 arw180
Reply | Threaded
Open this post in threaded view
|

Re: IgniteCache.size() is hanging

The only log I see is from one of the server nodes, which is spewing at a very high rate:

[grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/<ip>:47100, rmtAddr=<ip>:<port>

Note that each time the log is printed, i see a different value for <port>.

Also note  that I only see these logs when i try to run ignitevisorcmd's "cache" command. When I run the java application that calls IgniteCache.size(), I don't see any such logs. But in both cases, the result is that the operation is just hanging.

The cluster is active and I am able to insert data (albeit at a pretty slow rate), so it's not like things are completely non-functional. It's really confusing :\

On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov <[hidden email]> wrote:
Hi,

Can you please provide the full server logs?

BR,
Andrei



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
aealexsandrov aealexsandrov
Reply | Threaded
Open this post in threaded view
|

Re: IgniteCache.size() is hanging

Hi,

Highly likely some of the nodes go offline and try to connect again. Probably you had some network issues. I think I will see this and other information in the logs. Can you provide them?

BR,
Andrei

9/24/2020 6:54 PM, Alan Ward пишет:
The only log I see is from one of the server nodes, which is spewing at a very high rate:

[grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/<ip>:47100, rmtAddr=<ip>:<port>

Note that each time the log is printed, i see a different value for <port>.

Also note  that I only see these logs when i try to run ignitevisorcmd's "cache" command. When I run the java application that calls IgniteCache.size(), I don't see any such logs. But in both cases, the result is that the operation is just hanging.

The cluster is active and I am able to insert data (albeit at a pretty slow rate), so it's not like things are completely non-functional. It's really confusing :\

On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov <[hidden email]> wrote:
Hi,

Can you please provide the full server logs?

BR,
Andrei



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
arw180 arw180
Reply | Threaded
Open this post in threaded view
|

Re: IgniteCache.size() is hanging

I wish I could -- this cluster is running on an isolated network and I can't get the logs or configs or anything down to the Internet.

But, I just figured out the problem -- I had set a very large value for failureDetectionTimeout (default is 10s). When I reverted that to the default, everything started working great.

This is interesting, because in 2.7.3, bumping up this setting didn't cause the same problem. I went back and forth between 2.7.3 and 2.8.1 a few times (using the same config w/ the large failureDetectionTimeout) and was able to replicate this -- worked fine in 2.7.3, and broke in 2.8.1.

Hopefully this helps someone else out there,

Alan



On Thu, Sep 24, 2020 at 12:08 PM Andrei Aleksandrov <[hidden email]> wrote:

Hi,

Highly likely some of the nodes go offline and try to connect again. Probably you had some network issues. I think I will see this and other information in the logs. Can you provide them?

BR,
Andrei

9/24/2020 6:54 PM, Alan Ward пишет:
The only log I see is from one of the server nodes, which is spewing at a very high rate:

[grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/<ip>:47100, rmtAddr=<ip>:<port>

Note that each time the log is printed, i see a different value for <port>.

Also note  that I only see these logs when i try to run ignitevisorcmd's "cache" command. When I run the java application that calls IgniteCache.size(), I don't see any such logs. But in both cases, the result is that the operation is just hanging.

The cluster is active and I am able to insert data (albeit at a pretty slow rate), so it's not like things are completely non-functional. It's really confusing :\

On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov <[hidden email]> wrote:
Hi,

Can you please provide the full server logs?

BR,
Andrei



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
arw180 arw180
Reply | Threaded
Open this post in threaded view
|

Re: IgniteCache.size() is hanging

Sorry, meant 2.7.6, not 2.7.3

On Tue, Sep 29, 2020 at 7:40 AM Alan Ward <[hidden email]> wrote:
I wish I could -- this cluster is running on an isolated network and I can't get the logs or configs or anything down to the Internet.

But, I just figured out the problem -- I had set a very large value for failureDetectionTimeout (default is 10s). When I reverted that to the default, everything started working great.

This is interesting, because in 2.7.3, bumping up this setting didn't cause the same problem. I went back and forth between 2.7.3 and 2.8.1 a few times (using the same config w/ the large failureDetectionTimeout) and was able to replicate this -- worked fine in 2.7.3, and broke in 2.8.1.

Hopefully this helps someone else out there,

Alan



On Thu, Sep 24, 2020 at 12:08 PM Andrei Aleksandrov <[hidden email]> wrote:

Hi,

Highly likely some of the nodes go offline and try to connect again. Probably you had some network issues. I think I will see this and other information in the logs. Can you provide them?

BR,
Andrei

9/24/2020 6:54 PM, Alan Ward пишет:
The only log I see is from one of the server nodes, which is spewing at a very high rate:

[grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/<ip>:47100, rmtAddr=<ip>:<port>

Note that each time the log is printed, i see a different value for <port>.

Also note  that I only see these logs when i try to run ignitevisorcmd's "cache" command. When I run the java application that calls IgniteCache.size(), I don't see any such logs. But in both cases, the result is that the operation is just hanging.

The cluster is active and I am able to insert data (albeit at a pretty slow rate), so it's not like things are completely non-functional. It's really confusing :\

On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov <[hidden email]> wrote:
Hi,

Can you please provide the full server logs?

BR,
Andrei



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: IgniteCache.size() is hanging

Hello!

Can you please share a reproducer project which highlights the issue?

Thanks,
--
Ilya Kasnacheev


вт, 29 сент. 2020 г. в 18:58, Alan Ward <[hidden email]>:
Sorry, meant 2.7.6, not 2.7.3

On Tue, Sep 29, 2020 at 7:40 AM Alan Ward <[hidden email]> wrote:
I wish I could -- this cluster is running on an isolated network and I can't get the logs or configs or anything down to the Internet.

But, I just figured out the problem -- I had set a very large value for failureDetectionTimeout (default is 10s). When I reverted that to the default, everything started working great.

This is interesting, because in 2.7.3, bumping up this setting didn't cause the same problem. I went back and forth between 2.7.3 and 2.8.1 a few times (using the same config w/ the large failureDetectionTimeout) and was able to replicate this -- worked fine in 2.7.3, and broke in 2.8.1.

Hopefully this helps someone else out there,

Alan



On Thu, Sep 24, 2020 at 12:08 PM Andrei Aleksandrov <[hidden email]> wrote:

Hi,

Highly likely some of the nodes go offline and try to connect again. Probably you had some network issues. I think I will see this and other information in the logs. Can you provide them?

BR,
Andrei

9/24/2020 6:54 PM, Alan Ward пишет:
The only log I see is from one of the server nodes, which is spewing at a very high rate:

[grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/<ip>:47100, rmtAddr=<ip>:<port>

Note that each time the log is printed, i see a different value for <port>.

Also note  that I only see these logs when i try to run ignitevisorcmd's "cache" command. When I run the java application that calls IgniteCache.size(), I don't see any such logs. But in both cases, the result is that the operation is just hanging.

The cluster is active and I am able to insert data (albeit at a pretty slow rate), so it's not like things are completely non-functional. It's really confusing :\

On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov <[hidden email]> wrote:
Hi,

Can you please provide the full server logs?

BR,
Andrei



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/