Reconfiguring or Recovering an etcd Cluster in Terminal Server

This document describes how to reconfigure an etcd node or restore a whole etcd cluster that is used by Terminal Server.

Follow this guide only if you were referred to it from a different chapter from the documentation.

Checking a Terminal Server etcd Cluster's Health

  1. Connect to a server where YSoft SafeQ Management Service is installed

    1. Start Command line (CMD) and navigate to the "<install_dir>\SPOC\terminalserver\etcd\" folder

    2. Check the Terminal Server etcd cluster's health

      1. Run this command:

        etcdctl.exe --endpoint http://10.0.5.217:2377 cluster-health

        Replace 10.0.5.217 with the actual IP of a server where is Spooler Controller still functional or where the configuration will not be changed.

      2. The output will contain a list of Terminal Server etcd cluster members, and the last line will report the Terminal Server etcd cluster's health – it can be:

        1. cluster is healthy

        2. cluster is unhealthy

When the Terminal Server etcd Cluster Is Healthy

If the etcd quorum is not lost, then you can remove the affected node from the etcd cluster configuration and add a new or reconfigured node.

If you do not mind that all Embedded Terminals that are managed by the affected Spooler Controller Group will need to be reinstalled, you can use the when TS etcd cluster is unhealthy procedure, which is much simpler.

Example environment:

  • Management Service is installed on IP address 10.0.13.148

  • First Site Server is installed on IP address 10.0.5.217 – etcd member ID 5df1a03e6509526c

  • Second Site Server is installed on IP address 10.0.5.218 – etcd member ID 4698d36b2a32ca93

  • Third Site Server is installed on IP address 10.0.5.219 – etcd member ID 54237a9912a7236 (this node will be reinstalled and recovered)

Example Result of a Terminal Server etcd Cluster Health Check

The Third Site Server was reinstalled as an example


failed to check the health of member 54237a9912a7236 on http://10.0.5.219:2377: Get http://10.0.5.219:2377/health: dial
tcp 10.0.5.219:2377: connectex: No connection could be made because the target machine actively refused it.
member 54237a9912a7236 is unreachable: [http://10.0.5.219:2377] are all unreachable
member 4698d36b2a32ca93 is healthy: got healthy result from http://10.0.5.218:2377
member 5df1a03e6509526c is healthy: got healthy result from http://10.0.5.217:2377
cluster is healthy

Stop the affected node and delete its data

  1. Stop the YSoft SafeQ Terminal Server service on the affected node

  2. Delete the folder TS-XX.XX.XX.XX in "<install_dir>\SPOC\terminalserver\etcd\" on the affected node

Remove the Affected Node from the etcd Cluster

The affected node is the one with failed to check the health of member.

  1. Remove the affected node.

    1. Run this command:

      etcdctl.exe --endpoint http://10.0.5.217:2377 member remove 54237a9912a7236

      Replace 10.0.5.217 with the actual IP address of a server where Spooler Controller is still functional or where the configuration was not hanged.

      Replace 54237a9912a7236 with the actual etcd member ID of the reinstalled server.

    2. The result should look like this:

      Removed member 54237a9912a7236 from cluster
  2. Verify the cluster health again.

    1. Run this command:

      etcdctl.exe --endpoint http://10.0.5.217:2377 cluster-health

      Replace 10.0.5.217 with the actual IP address of a server where Spooler Controller is still functional or where the configuration was not changed.

    2. The result should look like this:

      member 4698d36b2a32ca93 is healthy: got healthy result from http://10.0.5.218:2377
      member 5df1a03e6509526c is healthy: got healthy result from http://10.0.5.217:2377
      cluster is healthy

Add the Affected Node to the etcd Cluster Again

  1. Add the affected node again.

    1. Run this command:

      etcdctl.exe --endpoint http://10.0.5.217:2377 member add TS-10.0.5.219 http://10.0.5.219:2378

      Replace 10.0.5.217 with the actual IP address of a server where Spooler Controller is still functional or where the configuration was not changed.

      Replace 10.0.5.219 (the IP address of the affected server) with the actual IP address of the affected server.

    2. The result should look like this:

      Added member named TS-10.0.5.219 with ID 188abf215116e622 to cluster
       
      ETCD_NAME="TS-10.0.5.219"
      ETCD_INITIAL_CLUSTER="TS-10.0.5.219=http://10.0.5.219:2378,TS-10.0.5.218=http://10.0.5.218:2378,TS-10.0.5.217=http://10.0.5.217:2378"
      ETCD_INITIAL_CLUSTER_STATE="existing"
  2. Verify the cluster health again.

    1. Run this command:

      etcdctl.exe --endpoint http://10.0.5.217:2377 cluster-health

      Replace 10.0.5.217 with the actual IP address of a server where Spooler Controller is still functional or where the configuration was not changed.

    2. The result should look like this:

      member 188abf215116e622 is unreachable: no available published client urls
      member 4698d36b2a32ca93 is healthy: got healthy result from http://10.0.5.218:2377
      member 5df1a03e6509526c is healthy: got healthy result from http://10.0.5.217:2377
      cluster is healthy

Connect to the Affected Node

  1. Start Command line (CMD) and move to the "<install_dir>\SPOC\terminalserver\etcd\" folder.

    Do not use PowerShell!

  2. Run etcd manually to create the proper etcd configuration:

    This is needed only once after the changes.

    etcd64.exe -name TS-10.0.5.219 -data-dir "c:\SafeQ6\SPOC\terminalserver\etcd\TS-10.0.5.219" -initial-advertise-peer-urls http://10.0.5.219:2378 -listen-peer-urls http://10.0.5.219:2378 -listen-client-urls http://10.0.5.219:2377,http://127.0.0.1:2377 -advertise-client-urls http://10.0.5.219:2377 -initial-cluster-token safeq-cluster -initial-cluster TS-10.0.5.219=http://10.0.5.219:2378,TS-10.0.5.218=http://10.0.5.218:2378,TS-10.0.5.217=http://10.0.5.217:2378 -initial-cluster-state existing

    Replace 10.0.5.219 (the IP address of the affected server) with the actual IP address of the affected server.

    Replace -initial-cluster values with ETCD_INITIAL_CLUSTER values that were shown just after adding the affected node back in point 2.4. b.

    The command will not exist, it will just keep running showing various messages. Wait till there is an information that affected node was published and continue with the next step:

    <datetime> I | etcdserver: published {Name:TS-10.0.5.219 ClientURLs:[http://10.0.5.219 :2377]} to cluster fb81dcd206a7a785
  3. Start the YSoft SafeQ Terminal Server service.

    • at this point the previously launched command in CMD will exit

  4. Verify that "Offline storage refreshed" can be seen in the Terminal Server log at least at ten minutes after the start of Terminal Server.

Connect to a Server Where YSoft SafeQ Management Service Is Installed and Verify the etcd Cluster's Health Again.

  1. Start Command line (CMD) and navigate to the "<install_dir>\SPOC\terminalserver\etcd\" folder.

    1. Run this command:

      etcdctl.exe --endpoint http://10.0.5.217:2377 cluster-health

      Replace 10.0.5.217 with the actual IP of a server where is Spooler Controller still functional or where the configuration was not changed.

    2. The result should look like this:

      member 188abf215116e622 is healthy: got healthy result from http://10.0.5.219:2377
      member 4698d36b2a32ca93 is healthy: got healthy result from http://10.0.5.218:2377
      member 5df1a03e6509526c is healthy: got healthy result from http://10.0.5.217:2377
      cluster is healthy
    3. The node is now reconfigured.

When the Terminal Server etcd Cluster Is Unhealthy

Unfortunately, you cannot add or remove nodes if the Terminal Server etcd quorum was lost. You can only recreate the Terminal Server etcd cluster again.

All data stored inside the Terminal Server etcd cluster will be lost, so you will need to reinstall all affected YSoft SafeQ Embedded Terminals after cluster recreation.

  1. Stop the YSoft SafeQ Terminal Server service on all nodes in the affected Spooler Controller Group.

    1. Back up the folder TS-XX.XX.XX.XX in "<install_dir>\SPOC\terminalserver\etcd\" on all nodes.

    2. Delete the folder TS-XX.XX.XX.XX in "<install_dir>\SPOC\terminalserver\etcd\" on all nodes.

  2. Start the YSoft SafeQ Terminal Server service on all nodes.

    1. Verify that "Offline storage refreshed" can be seen in the Terminal Server log after the start of Terminal Server (it might take up to 10 minutes before this record appears)

  3. Reinstall all YSoft SafeQ Embedded Terminals that are managed by the affected Spooler Controller Group.