removing ControlCenterAgent

classic Classic list List threaded Threaded
4 messages Options
Bastien Durel Bastien Durel
Reply | Threaded
Open this post in threaded view
|

removing ControlCenterAgent

Hello,

I'm running a 2.9.0 cluster with 2 nodes. I tried to use grid grain's
ControlCenterAgent to investigate a slowdown.

When I removed the agent files from server (I don't like to have to put
it in all clients), the second node cannot join the cluster when I
start it.

If I start node A, then node B, node B fails, but if I start node B,
then node A, node A fails.

If I put the agent files back, then all nodes can start, but clients
fail because they don't have the agent classes themselves.

When a node fails to start, it prints this log :


[17:52:45,265][INFO][tcp-disco-sock-reader-[2f3f6f3a 192.168.43.29:39675]-#6%ClusterWA%-#50%ClusterWA%][TcpDiscoverySpi] Initialized connection with remote server node [nodeId=2f3f6f3a-accb-4708-a5cc-26d324a07816, rmtAddr=/192.168.43.29:39675]
[17:52:45,268][SEVERE][main][IgniteKernal%ClusterWA] Failed to start manager: GridManagerAdapter [enabled=true, name=o.a.i.i.managers.discovery.GridDiscoveryManager]
class org.apache.ignite.IgniteCheckedException: Failed to start SPI: TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000, ackTimeout=5000, marsh=JdkMarshaller [clsFilter=org.apache.ignite.marshaller.MarshallerUtils$1@39a8e2fa], reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=5, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, skipAddrsRandomization=false]
        at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:302)
        at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:967)
        at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1935)
        at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1298)
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2046)
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1698)
        at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1114)
        at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1032)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:918)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:817)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:687)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:656)
        at org.apache.ignite.Ignition.start(Ignition.java:353)
        at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:300)
Caused by: class org.apache.ignite.spi.IgniteSpiException: Unable to unmarshal key=metastorage.cluster.id.tag
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2018)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1189)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:462)
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2120)
        at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:299)
        ... 13 more
[17:52:45,271][SEVERE][main][IgniteKernal%ClusterWA] Got exception while starting (will rollback startup routine).
class org.apache.ignite.IgniteCheckedException: Failed to start manager: GridManagerAdapter [enabled=true, name=org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
        at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1940)
        at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1298)
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2046)
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1698)
        at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1114)
        at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1032)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:918)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:817)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:687)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:656)
        at org.apache.ignite.Ignition.start(Ignition.java:353)
        at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:300)
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to start SPI: TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000, ackTimeout=5000, marsh=JdkMarshaller [clsFilter=org.apache.ignite.marshaller.MarshallerUtils$1@39a8e2fa], reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=5, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, skipAddrsRandomization=false]
        at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:302)
        at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:967)
        at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1935)
        ... 11 more
Caused by: class org.apache.ignite.spi.IgniteSpiException: Unable to unmarshal key=metastorage.cluster.id.tag
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2018)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1189)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:462)
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2120)
        at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:299)
        ... 13 more
[17:52:45,271][INFO][tcp-disco-sock-reader-[2f3f6f3a 192.168.43.29:39675]-#6%ClusterWA%-#50%ClusterWA%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.43.29:39675, rmtPort=39675

And the running node has this :

[17:52:45,223][INFO][tcp-disco-sock-reader-[9a3233c6 192.168.43.30:54951]-#4%ClusterWA%-#55%ClusterWA%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.43.30:54951, rmtPort=54951
[17:52:45,246][INFO][tcp-disco-msg-worker-[crd]-#2%ClusterWA%-#46%ClusterWA%][GridEncryptionManager] Joining node doesn't have stored group keys [node=9a3233c6-3a6c-4be0-b5e7-19cdff30f69e]
[17:52:45,266][WARNING][disco-pool-#56%ClusterWA%][TcpDiscoverySpi] Unable to unmarshal key=metastorage.cluster.id.tag

If I start the nodes in the reverse order, it has this :

[17:56:52,426][INFO][tcp-disco-sock-reader-[4b8b92f5 192.168.43.29:42557]-#4%ClusterWA%-#53%ClusterWA%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.43.29:42557, rmtPort=42557
[17:56:52,446][INFO][tcp-disco-msg-worker-[crd]-#2%ClusterWA%-#46%ClusterWA%][GridEncryptionManager] Joining node doesn't have stored group keys [node=4b8b92f5-1753-4b1b-9902-476c925fa49d]
[17:56:52,466][WARNING][disco-pool-#54%ClusterWA%][TcpDiscoverySpi] Unable to unmarshal key=metastorage.cluster.id.tag

Is there a way to recover ?

Thanks,

--
Bastien Durel
DATA
Intégration des données de l'entreprise,
Systèmes d'information décisionnels.

[hidden email]
tel : +33 (0) 1 57 19 59 28
fax : +33 (0) 1 57 19 59 73
45 avenue Carnot, 94230 CACHAN France
www.data.fr


Bastien Durel Bastien Durel
Reply | Threaded
Open this post in threaded view
|

Re: removing ControlCenterAgent

I forget to attach my configuration (I removed the cache config details)I'm using the debian package, so I run the cluster with xml
configuration.

Regards,


--
Bastien Durel
DATA
Intégration des données de l'entreprise,
Systèmes d'information décisionnels.

[hidden email]
tel : +33 (0) 1 57 19 59 28
fax : +33 (0) 1 57 19 59 73
45 avenue Carnot, 94230 CACHAN France
www.data.fr


ignite.xml (4K) Download Attachment
Denis Mekhanikov Denis Mekhanikov
Reply | Threaded
Open this post in threaded view
|

Re: removing ControlCenterAgent

In reply to this post by Bastien Durel
Hi!

The issue is that Control Center Agent puts its configuration to the meta-storage.
Ignite has an issue with processing data in meta-storage with class that is not present on all nodes: https://issues.apache.org/jira/browse/IGNITE-13642
Effectively it means that you can't remove control-center-agent from a cluster that worked with it previously.

You have a few options how to solve it:
- Add control-center-agent to class path of all nodes and disable it using management.sh --off. Classes and configuration will be there, but it won't do anything. You'll be able to remove the library after an upgrade to the version that doesn't have this bug. Hopefully, it will be fixed in Ignite 2.9.1

- Remove the metastorage directory from the persistence directory on all nodes. It will lead to removal of Control Center Agent configuration along with Baseline Topology history.
You will need to do that together with removal of the control-center-agent library.
NOTE that removal of metastorage is a dangerous operation and can lead to data loss. I recommend using the first option if it works for you.
Make a copy of persistence directories before removing anything. After the removal and a restart the baseline topology will be reset. Make sure that first activation will lead to the same BLT like before the restart to avoid data loss.

Also note that Control Center doesn't support Ignite 2.9 yet. The agent for it is on its way. Currently only Ignite 2.8 is supported.

Denis

On 28.10.2020, 19:58, "Bastien Durel" <[hidden email]> wrote:

    Hello,

    I'm running a 2.9.0 cluster with 2 nodes. I tried to use grid grain's
    ControlCenterAgent to investigate a slowdown.

    When I removed the agent files from server (I don't like to have to put
    it in all clients), the second node cannot join the cluster when I
    start it.

    If I start node A, then node B, node B fails, but if I start node B,
    then node A, node A fails.

    If I put the agent files back, then all nodes can start, but clients
    fail because they don't have the agent classes themselves.

    When a node fails to start, it prints this log :


    [17:52:45,265][INFO][tcp-disco-sock-reader-[2f3f6f3a 192.168.43.29:39675]-#6%ClusterWA%-#50%ClusterWA%][TcpDiscoverySpi] Initialized connection with remote server node [nodeId=2f3f6f3a-accb-4708-a5cc-26d324a07816, rmtAddr=/192.168.43.29:39675]
    [17:52:45,268][SEVERE][main][IgniteKernal%ClusterWA] Failed to start manager: GridManagerAdapter [enabled=true, name=o.a.i.i.managers.discovery.GridDiscoveryManager]
    class org.apache.ignite.IgniteCheckedException: Failed to start SPI: TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000, ackTimeout=5000, marsh=JdkMarshaller [clsFilter=org.apache.ignite.marshaller.MarshallerUtils$1@39a8e2fa], reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=5, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, skipAddrsRandomization=false]
    at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:302)
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:967)
    at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1935)
    at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1298)
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2046)
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1698)
    at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1114)
    at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1032)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:918)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:817)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:687)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:656)
    at org.apache.ignite.Ignition.start(Ignition.java:353)
    at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:300)
    Caused by: class org.apache.ignite.spi.IgniteSpiException: Unable to unmarshal key=metastorage.cluster.id.tag
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2018)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1189)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:462)
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2120)
    at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:299)
    ... 13 more
    [17:52:45,271][SEVERE][main][IgniteKernal%ClusterWA] Got exception while starting (will rollback startup routine).
    class org.apache.ignite.IgniteCheckedException: Failed to start manager: GridManagerAdapter [enabled=true, name=org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
    at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1940)
    at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1298)
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2046)
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1698)
    at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1114)
    at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1032)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:918)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:817)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:687)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:656)
    at org.apache.ignite.Ignition.start(Ignition.java:353)
    at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:300)
    Caused by: class org.apache.ignite.IgniteCheckedException: Failed to start SPI: TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000, ackTimeout=5000, marsh=JdkMarshaller [clsFilter=org.apache.ignite.marshaller.MarshallerUtils$1@39a8e2fa], reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=5, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, skipAddrsRandomization=false]
    at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:302)
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:967)
    at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1935)
    ... 11 more
    Caused by: class org.apache.ignite.spi.IgniteSpiException: Unable to unmarshal key=metastorage.cluster.id.tag
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2018)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1189)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:462)
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2120)
    at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:299)
    ... 13 more
    [17:52:45,271][INFO][tcp-disco-sock-reader-[2f3f6f3a 192.168.43.29:39675]-#6%ClusterWA%-#50%ClusterWA%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.43.29:39675, rmtPort=39675

    And the running node has this :

    [17:52:45,223][INFO][tcp-disco-sock-reader-[9a3233c6 192.168.43.30:54951]-#4%ClusterWA%-#55%ClusterWA%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.43.30:54951, rmtPort=54951
    [17:52:45,246][INFO][tcp-disco-msg-worker-[crd]-#2%ClusterWA%-#46%ClusterWA%][GridEncryptionManager] Joining node doesn't have stored group keys [node=9a3233c6-3a6c-4be0-b5e7-19cdff30f69e]
    [17:52:45,266][WARNING][disco-pool-#56%ClusterWA%][TcpDiscoverySpi] Unable to unmarshal key=metastorage.cluster.id.tag

    If I start the nodes in the reverse order, it has this :

    [17:56:52,426][INFO][tcp-disco-sock-reader-[4b8b92f5 192.168.43.29:42557]-#4%ClusterWA%-#53%ClusterWA%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.43.29:42557, rmtPort=42557
    [17:56:52,446][INFO][tcp-disco-msg-worker-[crd]-#2%ClusterWA%-#46%ClusterWA%][GridEncryptionManager] Joining node doesn't have stored group keys [node=4b8b92f5-1753-4b1b-9902-476c925fa49d]
    [17:56:52,466][WARNING][disco-pool-#54%ClusterWA%][TcpDiscoverySpi] Unable to unmarshal key=metastorage.cluster.id.tag

    Is there a way to recover ?

    Thanks,

    --
    Bastien Durel
    DATA
    Intégration des données de l'entreprise,
    Systèmes d'information décisionnels.

    [hidden email]
    tel : +33 (0) 1 57 19 59 28
    fax : +33 (0) 1 57 19 59 73
    45 avenue Carnot, 94230 CACHAN France
    www.data.fr


Bastien Durel Bastien Durel
Reply | Threaded
Open this post in threaded view
|

Re: removing ControlCenterAgent

Le jeudi 29 octobre 2020 à 12:07 +0000, Mekhanikov Denis a écrit :

> Hi!
>
> The issue is that Control Center Agent puts its configuration to the
> meta-storage.
> Ignite has an issue with processing data in meta-storage with class
> that is not present on all nodes:
> https://issues.apache.org/jira/browse/IGNITE-13642
> Effectively it means that you can't remove control-center-agent from
> a cluster that worked with it previously.
>
> You have a few options how to solve it:
> - Add control-center-agent to class path of all nodes and disable it
> using management.sh --off. Classes and configuration will be there,
> but it won't do anything. You'll be able to remove the library after
> an upgrade to the version that doesn't have this bug. Hopefully, it
> will be fixed in Ignite 2.9.1
>
> - Remove the metastorage directory from the persistence directory on
> all nodes. It will lead to removal of Control Center Agent
> configuration along with Baseline Topology history.
> You will need to do that together with removal of the control-center-
> agent library.
> NOTE that removal of metastorage is a dangerous operation and can
> lead to data loss. I recommend using the first option if it works for
> you.
> Make a copy of persistence directories before removing anything.
> After the removal and a restart the baseline topology will be reset.
> Make sure that first activation will lead to the same BLT like before
> the restart to avoid data loss.
>
Hello,

Thanks for info. I've removed the db directory on all nodes, as most of
my data is in 3rd-party storage, and I can live without event logs that
uses ignite storage, as we're not in production.

We'll keep this in mind to avoid future problems.

Regards,

--
Bastien Durel
DATA
Intégration des données de l'entreprise,
Systèmes d'information décisionnels.

[hidden email]
tel : +33 (0) 1 57 19 59 28
fax : +33 (0) 1 57 19 59 73
45 avenue Carnot, 94230 CACHAN France
www.data.fr