Lost node again.

classic Classic list List threaded Threaded
8 messages Options
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Lost node again.

Hi guys it seems every couple of weeks we lose a node... Here are the logs: https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0

And some extra details. Maybe I need to do more tuning then what is already mentioned below, maybe set a higher timeout?

3 server nodes and 9 clients (client = true)

Performance wise the cluster is not doing any kind of high volume on average it does about 15-20 puts/gets/queries (any combination of) per 30-60 seconds.

The biggest cache we have is: 3 million records distributed with 1 backup using the following template.

          <bean id="cache-template-bean" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
            <!-- when you create a template via XML configuration,
            you must add an asterisk to the name of the template -->
            <property name="name" value="partitionedTpl*"/>
            <property name="cacheMode" value="PARTITIONED" />
            <property name="backups" value="1" />
            <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
          </bean>

Persistence is configured:

      <property name="dataStorageConfiguration">
        <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
          <!-- Redefining the default region's settings -->
          <property name="defaultDataRegionConfiguration">
            <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="persistenceEnabled" value="true"/>

              <property name="name" value="Default_Region"/>
              <property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
            </bean>
          </property>
        </bean>
      </property> 

We also followed the tuning instructions for GC and I/O
if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
fi

#
# Uncomment the following GC settings if you see spikes in your throughput due to Garbage Collection.
#
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm.dirty_expire_centisecs=500

ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Lost node again.

Hello!

[13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company] Possible too long JVM pause: 41779 milliseconds.

It seems that you have too-long full GC. Either make sure it does not happen, or increase failureDetectionTimeout to be longer than any expected GC.

Regards,
--
Ilya Kasnacheev


пн, 17 авг. 2020 г. в 17:51, John Smith <[hidden email]>:
Hi guys it seems every couple of weeks we lose a node... Here are the logs: https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0

And some extra details. Maybe I need to do more tuning then what is already mentioned below, maybe set a higher timeout?

3 server nodes and 9 clients (client = true)

Performance wise the cluster is not doing any kind of high volume on average it does about 15-20 puts/gets/queries (any combination of) per 30-60 seconds.

The biggest cache we have is: 3 million records distributed with 1 backup using the following template.

          <bean id="cache-template-bean" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
            <!-- when you create a template via XML configuration,
            you must add an asterisk to the name of the template -->
            <property name="name" value="partitionedTpl*"/>
            <property name="cacheMode" value="PARTITIONED" />
            <property name="backups" value="1" />
            <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
          </bean>

Persistence is configured:

      <property name="dataStorageConfiguration">
        <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
          <!-- Redefining the default region's settings -->
          <property name="defaultDataRegionConfiguration">
            <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="persistenceEnabled" value="true"/>

              <property name="name" value="Default_Region"/>
              <property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
            </bean>
          </property>
        </bean>
      </property> 

We also followed the tuning instructions for GC and I/O
if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
fi

#
# Uncomment the following GC settings if you see spikes in your throughput due to Garbage Collection.
#
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm.dirty_expire_centisecs=500

javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Lost node again.

I don't see why we would get such a huge pause, in fact I have provided GC logs before and we found nothing...

All operations are in the "big" partitioned 3 million cache are put or get and a query on another cache which has 450 entries. There no other caches.

The nodes all have 6G off heap and 26G off heap.

I think it can be IO related but I can't seem to be able to correlate it to IO. I saw some heavy IO usage but the node failed way after.

Now my question is should I put the failure detection to 60s just for the sake of trying it? Isn't that too high? If i put the servers to 60s how how high should I put the clients?

On Tue., Aug. 18, 2020, 7:32 a.m. Ilya Kasnacheev, <[hidden email]> wrote:
Hello!

[13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company] Possible too long JVM pause: 41779 milliseconds.

It seems that you have too-long full GC. Either make sure it does not happen, or increase failureDetectionTimeout to be longer than any expected GC.

Regards,
--
Ilya Kasnacheev


пн, 17 авг. 2020 г. в 17:51, John Smith <[hidden email]>:
Hi guys it seems every couple of weeks we lose a node... Here are the logs: https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0

And some extra details. Maybe I need to do more tuning then what is already mentioned below, maybe set a higher timeout?

3 server nodes and 9 clients (client = true)

Performance wise the cluster is not doing any kind of high volume on average it does about 15-20 puts/gets/queries (any combination of) per 30-60 seconds.

The biggest cache we have is: 3 million records distributed with 1 backup using the following template.

          <bean id="cache-template-bean" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
            <!-- when you create a template via XML configuration,
            you must add an asterisk to the name of the template -->
            <property name="name" value="partitionedTpl*"/>
            <property name="cacheMode" value="PARTITIONED" />
            <property name="backups" value="1" />
            <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
          </bean>

Persistence is configured:

      <property name="dataStorageConfiguration">
        <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
          <!-- Redefining the default region's settings -->
          <property name="defaultDataRegionConfiguration">
            <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="persistenceEnabled" value="true"/>

              <property name="name" value="Default_Region"/>
              <property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
            </bean>
          </property>
        </bean>
      </property> 

We also followed the tuning instructions for GC and I/O
if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
fi

#
# Uncomment the following GC settings if you see spikes in your throughput due to Garbage Collection.
#
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm.dirty_expire_centisecs=500

ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Lost node again.

Hello!

Most of those questions are rhetorical, but I would say that 60s of failure detection timeout is not unheard of. For clients you can put smaller value (clientFailureDetectionTimeout) since losing a client is not as impactful.

Regards,
--
Ilya Kasnacheev


вт, 18 авг. 2020 г. в 20:37, John Smith <[hidden email]>:
I don't see why we would get such a huge pause, in fact I have provided GC logs before and we found nothing...

All operations are in the "big" partitioned 3 million cache are put or get and a query on another cache which has 450 entries. There no other caches.

The nodes all have 6G off heap and 26G off heap.

I think it can be IO related but I can't seem to be able to correlate it to IO. I saw some heavy IO usage but the node failed way after.

Now my question is should I put the failure detection to 60s just for the sake of trying it? Isn't that too high? If i put the servers to 60s how how high should I put the clients?

On Tue., Aug. 18, 2020, 7:32 a.m. Ilya Kasnacheev, <[hidden email]> wrote:
Hello!

[13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company] Possible too long JVM pause: 41779 milliseconds.

It seems that you have too-long full GC. Either make sure it does not happen, or increase failureDetectionTimeout to be longer than any expected GC.

Regards,
--
Ilya Kasnacheev


пн, 17 авг. 2020 г. в 17:51, John Smith <[hidden email]>:
Hi guys it seems every couple of weeks we lose a node... Here are the logs: https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0

And some extra details. Maybe I need to do more tuning then what is already mentioned below, maybe set a higher timeout?

3 server nodes and 9 clients (client = true)

Performance wise the cluster is not doing any kind of high volume on average it does about 15-20 puts/gets/queries (any combination of) per 30-60 seconds.

The biggest cache we have is: 3 million records distributed with 1 backup using the following template.

          <bean id="cache-template-bean" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
            <!-- when you create a template via XML configuration,
            you must add an asterisk to the name of the template -->
            <property name="name" value="partitionedTpl*"/>
            <property name="cacheMode" value="PARTITIONED" />
            <property name="backups" value="1" />
            <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
          </bean>

Persistence is configured:

      <property name="dataStorageConfiguration">
        <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
          <!-- Redefining the default region's settings -->
          <property name="defaultDataRegionConfiguration">
            <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="persistenceEnabled" value="true"/>

              <property name="name" value="Default_Region"/>
              <property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
            </bean>
          </property>
        </bean>
      </property> 

We also followed the tuning instructions for GC and I/O
if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
fi

#
# Uncomment the following GC settings if you see spikes in your throughput due to Garbage Collection.
#
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm.dirty_expire_centisecs=500

dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: Lost node again.

In reply to this post by javadevmtl
John,

I would try to get to the bottom of the issue, especially, if the case is reproducible. 

If that's not GC then check if that's the I/O (your logs show that the checkpointing rate is high):
As for the failureDetectionTimeout, I would set it to 15 secs until your cluster is battle-tested and well-tuned for your use case.
 
-
Denis


On Tue, Aug 18, 2020 at 10:37 AM John Smith <[hidden email]> wrote:
I don't see why we would get such a huge pause, in fact I have provided GC logs before and we found nothing...

All operations are in the "big" partitioned 3 million cache are put or get and a query on another cache which has 450 entries. There no other caches.

The nodes all have 6G off heap and 26G off heap.

I think it can be IO related but I can't seem to be able to correlate it to IO. I saw some heavy IO usage but the node failed way after.

Now my question is should I put the failure detection to 60s just for the sake of trying it? Isn't that too high? If i put the servers to 60s how how high should I put the clients?

On Tue., Aug. 18, 2020, 7:32 a.m. Ilya Kasnacheev, <[hidden email]> wrote:
Hello!

[13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company] Possible too long JVM pause: 41779 milliseconds.

It seems that you have too-long full GC. Either make sure it does not happen, or increase failureDetectionTimeout to be longer than any expected GC.

Regards,
--
Ilya Kasnacheev


пн, 17 авг. 2020 г. в 17:51, John Smith <[hidden email]>:
Hi guys it seems every couple of weeks we lose a node... Here are the logs: https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0

And some extra details. Maybe I need to do more tuning then what is already mentioned below, maybe set a higher timeout?

3 server nodes and 9 clients (client = true)

Performance wise the cluster is not doing any kind of high volume on average it does about 15-20 puts/gets/queries (any combination of) per 30-60 seconds.

The biggest cache we have is: 3 million records distributed with 1 backup using the following template.

          <bean id="cache-template-bean" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
            <!-- when you create a template via XML configuration,
            you must add an asterisk to the name of the template -->
            <property name="name" value="partitionedTpl*"/>
            <property name="cacheMode" value="PARTITIONED" />
            <property name="backups" value="1" />
            <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
          </bean>

Persistence is configured:

      <property name="dataStorageConfiguration">
        <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
          <!-- Redefining the default region's settings -->
          <property name="defaultDataRegionConfiguration">
            <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="persistenceEnabled" value="true"/>

              <property name="name" value="Default_Region"/>
              <property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
            </bean>
          </property>
        </bean>
      </property> 

We also followed the tuning instructions for GC and I/O
if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
fi

#
# Uncomment the following GC settings if you see spikes in your throughput due to Garbage Collection.
#
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm.dirty_expire_centisecs=500

javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Lost node again.

Hi here is an example of our cluster during our normal "high" usage. The node shutting down seems to happen on "off" hours.

Dennis, wouldn't 15 seconds faillureDetectionTimeout cause even more shutdowns?
We also considered more tuning stuff in the docs, we' ll see I guess...
As for now we don't have separate disks for now.



On Wed, 19 Aug 2020 at 23:35, Denis Magda <[hidden email]> wrote:
John,

I would try to get to the bottom of the issue, especially, if the case is reproducible. 

If that's not GC then check if that's the I/O (your logs show that the checkpointing rate is high):
As for the failureDetectionTimeout, I would set it to 15 secs until your cluster is battle-tested and well-tuned for your use case.
 
-
Denis


On Tue, Aug 18, 2020 at 10:37 AM John Smith <[hidden email]> wrote:
I don't see why we would get such a huge pause, in fact I have provided GC logs before and we found nothing...

All operations are in the "big" partitioned 3 million cache are put or get and a query on another cache which has 450 entries. There no other caches.

The nodes all have 6G off heap and 26G off heap.

I think it can be IO related but I can't seem to be able to correlate it to IO. I saw some heavy IO usage but the node failed way after.

Now my question is should I put the failure detection to 60s just for the sake of trying it? Isn't that too high? If i put the servers to 60s how how high should I put the clients?

On Tue., Aug. 18, 2020, 7:32 a.m. Ilya Kasnacheev, <[hidden email]> wrote:
Hello!

[13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company] Possible too long JVM pause: 41779 milliseconds.

It seems that you have too-long full GC. Either make sure it does not happen, or increase failureDetectionTimeout to be longer than any expected GC.

Regards,
--
Ilya Kasnacheev


пн, 17 авг. 2020 г. в 17:51, John Smith <[hidden email]>:
Hi guys it seems every couple of weeks we lose a node... Here are the logs: https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0

And some extra details. Maybe I need to do more tuning then what is already mentioned below, maybe set a higher timeout?

3 server nodes and 9 clients (client = true)

Performance wise the cluster is not doing any kind of high volume on average it does about 15-20 puts/gets/queries (any combination of) per 30-60 seconds.

The biggest cache we have is: 3 million records distributed with 1 backup using the following template.

          <bean id="cache-template-bean" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
            <!-- when you create a template via XML configuration,
            you must add an asterisk to the name of the template -->
            <property name="name" value="partitionedTpl*"/>
            <property name="cacheMode" value="PARTITIONED" />
            <property name="backups" value="1" />
            <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
          </bean>

Persistence is configured:

      <property name="dataStorageConfiguration">
        <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
          <!-- Redefining the default region's settings -->
          <property name="defaultDataRegionConfiguration">
            <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="persistenceEnabled" value="true"/>

              <property name="name" value="Default_Region"/>
              <property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
            </bean>
          </property>
        </bean>
      </property> 

We also followed the tuning instructions for GC and I/O
if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
fi

#
# Uncomment the following GC settings if you see spikes in your throughput due to Garbage Collection.
#
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm.dirty_expire_centisecs=500


Screen Shot 2020-08-20 at 9.33.37 AM.png (756K) Download Attachment
dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: Lost node again.

Dennis, wouldn't 15 seconds faillureDetectionTimeout cause even more shutdowns?
 
What's your current value? For sure, It doesn't make sense to decrease the value until all mysterious pauses are figured out. The downside of a high failureDetectionTimeout is that the cluster won't remove a node that failed for a reason until the timeout expires. So, if there is a failed node that has to process some operations then the rest of the cluster will be trying to reach it out until the failureDetectionTimeout is reached. That affects performance of some operations where the failed node has to be involved.

Btw, what's the tool you are using for the monitoring? Looks nice.

-
Denis


On Thu, Aug 20, 2020 at 6:44 AM John Smith <[hidden email]> wrote:
Hi here is an example of our cluster during our normal "high" usage. The node shutting down seems to happen on "off" hours.

Dennis, wouldn't 15 seconds faillureDetectionTimeout cause even more shutdowns?
We also considered more tuning stuff in the docs, we' ll see I guess...
As for now we don't have separate disks for now.



On Wed, 19 Aug 2020 at 23:35, Denis Magda <[hidden email]> wrote:
John,

I would try to get to the bottom of the issue, especially, if the case is reproducible. 

If that's not GC then check if that's the I/O (your logs show that the checkpointing rate is high):
As for the failureDetectionTimeout, I would set it to 15 secs until your cluster is battle-tested and well-tuned for your use case.
 
-
Denis


On Tue, Aug 18, 2020 at 10:37 AM John Smith <[hidden email]> wrote:
I don't see why we would get such a huge pause, in fact I have provided GC logs before and we found nothing...

All operations are in the "big" partitioned 3 million cache are put or get and a query on another cache which has 450 entries. There no other caches.

The nodes all have 6G off heap and 26G off heap.

I think it can be IO related but I can't seem to be able to correlate it to IO. I saw some heavy IO usage but the node failed way after.

Now my question is should I put the failure detection to 60s just for the sake of trying it? Isn't that too high? If i put the servers to 60s how how high should I put the clients?

On Tue., Aug. 18, 2020, 7:32 a.m. Ilya Kasnacheev, <[hidden email]> wrote:
Hello!

[13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company] Possible too long JVM pause: 41779 milliseconds.

It seems that you have too-long full GC. Either make sure it does not happen, or increase failureDetectionTimeout to be longer than any expected GC.

Regards,
--
Ilya Kasnacheev


пн, 17 авг. 2020 г. в 17:51, John Smith <[hidden email]>:
Hi guys it seems every couple of weeks we lose a node... Here are the logs: https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0

And some extra details. Maybe I need to do more tuning then what is already mentioned below, maybe set a higher timeout?

3 server nodes and 9 clients (client = true)

Performance wise the cluster is not doing any kind of high volume on average it does about 15-20 puts/gets/queries (any combination of) per 30-60 seconds.

The biggest cache we have is: 3 million records distributed with 1 backup using the following template.

          <bean id="cache-template-bean" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
            <!-- when you create a template via XML configuration,
            you must add an asterisk to the name of the template -->
            <property name="name" value="partitionedTpl*"/>
            <property name="cacheMode" value="PARTITIONED" />
            <property name="backups" value="1" />
            <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
          </bean>

Persistence is configured:

      <property name="dataStorageConfiguration">
        <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
          <!-- Redefining the default region's settings -->
          <property name="defaultDataRegionConfiguration">
            <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="persistenceEnabled" value="true"/>

              <property name="name" value="Default_Region"/>
              <property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
            </bean>
          </property>
        </bean>
      </property> 

We also followed the tuning instructions for GC and I/O
if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
fi

#
# Uncomment the following GC settings if you see spikes in your throughput due to Garbage Collection.
#
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm.dirty_expire_centisecs=500

javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Lost node again.

It's the default. And as per Ilya I had a suspected GC pause of 45000 ms so I figure 60 second would be ok. As for the GC pauses we (as in I and ignite team) have already looked at GC logs previously and it wasn't the issue.

For the monitoring we are using Elastisearch, with Metricbeat and Kibana as the dashboard. Not the latest because then I would be able to use JMX as well :p
I will try toll look into a JMX kafka log exporter or something and see if I can get them into Elastic when and if I hav3 time lol



On Thu, 20 Aug 2020 at 12:28, Denis Magda <[hidden email]> wrote:
Dennis, wouldn't 15 seconds faillureDetectionTimeout cause even more shutdowns?
 
What's your current value? For sure, It doesn't make sense to decrease the value until all mysterious pauses are figured out. The downside of a high failureDetectionTimeout is that the cluster won't remove a node that failed for a reason until the timeout expires. So, if there is a failed node that has to process some operations then the rest of the cluster will be trying to reach it out until the failureDetectionTimeout is reached. That affects performance of some operations where the failed node has to be involved.

Btw, what's the tool you are using for the monitoring? Looks nice.

-
Denis


On Thu, Aug 20, 2020 at 6:44 AM John Smith <[hidden email]> wrote:
Hi here is an example of our cluster during our normal "high" usage. The node shutting down seems to happen on "off" hours.

Dennis, wouldn't 15 seconds faillureDetectionTimeout cause even more shutdowns?
We also considered more tuning stuff in the docs, we' ll see I guess...
As for now we don't have separate disks for now.



On Wed, 19 Aug 2020 at 23:35, Denis Magda <[hidden email]> wrote:
John,

I would try to get to the bottom of the issue, especially, if the case is reproducible. 

If that's not GC then check if that's the I/O (your logs show that the checkpointing rate is high):
As for the failureDetectionTimeout, I would set it to 15 secs until your cluster is battle-tested and well-tuned for your use case.
 
-
Denis


On Tue, Aug 18, 2020 at 10:37 AM John Smith <[hidden email]> wrote:
I don't see why we would get such a huge pause, in fact I have provided GC logs before and we found nothing...

All operations are in the "big" partitioned 3 million cache are put or get and a query on another cache which has 450 entries. There no other caches.

The nodes all have 6G off heap and 26G off heap.

I think it can be IO related but I can't seem to be able to correlate it to IO. I saw some heavy IO usage but the node failed way after.

Now my question is should I put the failure detection to 60s just for the sake of trying it? Isn't that too high? If i put the servers to 60s how how high should I put the clients?

On Tue., Aug. 18, 2020, 7:32 a.m. Ilya Kasnacheev, <[hidden email]> wrote:
Hello!

[13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company] Possible too long JVM pause: 41779 milliseconds.

It seems that you have too-long full GC. Either make sure it does not happen, or increase failureDetectionTimeout to be longer than any expected GC.

Regards,
--
Ilya Kasnacheev


пн, 17 авг. 2020 г. в 17:51, John Smith <[hidden email]>:
Hi guys it seems every couple of weeks we lose a node... Here are the logs: https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0

And some extra details. Maybe I need to do more tuning then what is already mentioned below, maybe set a higher timeout?

3 server nodes and 9 clients (client = true)

Performance wise the cluster is not doing any kind of high volume on average it does about 15-20 puts/gets/queries (any combination of) per 30-60 seconds.

The biggest cache we have is: 3 million records distributed with 1 backup using the following template.

          <bean id="cache-template-bean" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
            <!-- when you create a template via XML configuration,
            you must add an asterisk to the name of the template -->
            <property name="name" value="partitionedTpl*"/>
            <property name="cacheMode" value="PARTITIONED" />
            <property name="backups" value="1" />
            <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
          </bean>

Persistence is configured:

      <property name="dataStorageConfiguration">
        <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
          <!-- Redefining the default region's settings -->
          <property name="defaultDataRegionConfiguration">
            <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="persistenceEnabled" value="true"/>

              <property name="name" value="Default_Region"/>
              <property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
            </bean>
          </property>
        </bean>
      </property> 

We also followed the tuning instructions for GC and I/O
if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
fi

#
# Uncomment the following GC settings if you see spikes in your throughput due to Garbage Collection.
#
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm.dirty_expire_centisecs=500