Reconfiguring or Recovering an etcd Cluster in Terminal Server
This document describes how to reconfigure an etcd node or restore a whole etcd cluster that is used by Terminal Server.
Follow this guide only if you were referred to it from a different chapter from the documentation.
Checking a Terminal Server etcd Cluster's Health
Connect to a server where YSoft SafeQ Management Service is installed
Start Command line (CMD) and navigate to the "<install_dir>\SPOC\terminalserver\etcd\" folder
Check the Terminal Server etcd cluster's health
Run this command:
etcdctl.exe --endpoint http:
//10.0.5.217:2377 cluster-health
Replace 10.0.5.217 with the actual IP of a server where is Spooler Controller still functional or where the configuration will not be changed.
The output will contain a list of Terminal Server etcd cluster members, and the last line will report the Terminal Server etcd cluster's health – it can be:
cluster is healthy
cluster is unhealthy
When the Terminal Server etcd Cluster Is Healthy
If the etcd quorum is not lost, then you can remove the affected node from the etcd cluster configuration and add a new or reconfigured node.
If you do not mind that all Embedded Terminals that are managed by the affected Spooler Controller Group will need to be reinstalled, you can use the when TS etcd cluster is unhealthy procedure, which is much simpler.
Example environment:
Management Service is installed on IP address 10.0.13.148
First Site Server is installed on IP address 10.0.5.217 – etcd member ID 5df1a03e6509526c
Second Site Server is installed on IP address 10.0.5.218 – etcd member ID 4698d36b2a32ca93
Third Site Server is installed on IP address 10.0.5.219 – etcd member ID 54237a9912a7236 (this node will be reinstalled and recovered)
Example Result of a Terminal Server etcd Cluster Health Check
The Third Site Server was reinstalled as an example
failed to check the health of member 54237a9912a7236 on http:
//10.0.5.219:2377: Get http://10.0.5.219:2377/health: dial
tcp
10.0
.
5.219
:
2377
: connectex: No connection could be made because the target machine actively refused it.
member 54237a9912a7236 is unreachable: [http:
//10.0.5.219:2377] are all unreachable
member 4698d36b2a32ca93 is healthy: got healthy result from http:
//10.0.5.218:2377
member 5df1a03e6509526c is healthy: got healthy result from http:
//10.0.5.217:2377
cluster is healthy
Stop the affected node and delete its data
Stop the YSoft SafeQ Terminal Server service on the affected node
Delete the folder TS-XX.XX.XX.XX in "<install_dir>\SPOC\terminalserver\etcd\" on the affected node
Remove the Affected Node from the etcd Cluster
The affected node is the one with failed to check the health of member.
Remove the affected node.
Run this command:
etcdctl.exe --endpoint http:
//10.0.5.217:2377 member remove 54237a9912a7236
Replace 10.0.5.217 with the actual IP address of a server where Spooler Controller is still functional or where the configuration was not hanged.
Replace 54237a9912a7236 with the actual etcd member ID of the reinstalled server.
The result should look like this:
Removed member 54237a9912a7236 from cluster
Verify the cluster health again.
Run this command:
etcdctl.exe --endpoint http:
//10.0.5.217:2377 cluster-health
Replace 10.0.5.217 with the actual IP address of a server where Spooler Controller is still functional or where the configuration was not changed.
The result should look like this:
member 4698d36b2a32ca93 is healthy: got healthy result from http:
//10.0.5.218:2377
member 5df1a03e6509526c is healthy: got healthy result from http:
//10.0.5.217:2377
cluster is healthy
Add the Affected Node to the etcd Cluster Again
Add the affected node again.
Run this command:
etcdctl.exe --endpoint http:
//10.0.5.217:2377 member add TS-10.0.5.219 http://10.0.5.219:2378
Replace 10.0.5.217 with the actual IP address of a server where Spooler Controller is still functional or where the configuration was not changed.
Replace 10.0.5.219 (the IP address of the affected server) with the actual IP address of the affected server.
The result should look like this:
Added member named TS-
10.0
.
5.219
with ID 188abf215116e622 to cluster
ETCD_NAME=
"TS-10.0.5.219"
ETCD_INITIAL_CLUSTER=
"TS-10.0.5.219=http://10.0.5.219:2378,TS-10.0.5.218=http://10.0.5.218:2378,TS-10.0.5.217=http://10.0.5.217:2378"
ETCD_INITIAL_CLUSTER_STATE=
"existing"
Verify the cluster health again.
Run this command:
etcdctl.exe --endpoint http:
//10.0.5.217:2377 cluster-health
Replace 10.0.5.217 with the actual IP address of a server where Spooler Controller is still functional or where the configuration was not changed.
The result should look like this:
member 188abf215116e622 is unreachable: no available published client urls
member 4698d36b2a32ca93 is healthy: got healthy result from http:
//10.0.5.218:2377
member 5df1a03e6509526c is healthy: got healthy result from http:
//10.0.5.217:2377
cluster is healthy
Connect to the Affected Node
Start Command line (CMD) and move to the "<install_dir>\SPOC\terminalserver\etcd\" folder.
Do not use PowerShell!
Run etcd manually to create the proper etcd configuration:
This is needed only once after the changes.
etcd64.exe -name TS-
10.0
.
5.219
-data-dir
"c:\SafeQ6\SPOC\terminalserver\etcd\TS-10.0.5.219"
-initial-advertise-peer-urls http:
//10.0.5.219:2378 -listen-peer-urls http://10.0.5.219:2378 -listen-client-urls http://10.0.5.219:2377,http://127.0.0.1:2377 -advertise-client-urls http://10.0.5.219:2377 -initial-cluster-token safeq-cluster -initial-cluster TS-10.0.5.219=http://10.0.5.219:2378,TS-10.0.5.218=http://10.0.5.218:2378,TS-10.0.5.217=http://10.0.5.217:2378 -initial-cluster-state existing
Replace 10.0.5.219 (the IP address of the affected server) with the actual IP address of the affected server.
Replace -initial-cluster values with ETCD_INITIAL_CLUSTER values that were shown just after adding the affected node back in point 2.4. b.
The command will not exist, it will just keep running showing various messages. Wait till there is an information that affected node was published and continue with the next step:
<datetime> I | etcdserver: published {Name:TS-
10.0
.
5.219
ClientURLs:[http:
//10.0.5.219 :2377]} to cluster fb81dcd206a7a785
Start the YSoft SafeQ Terminal Server service.
at this point the previously launched command in CMD will exit
Verify that "Offline storage refreshed" can be seen in the Terminal Server log at least at ten minutes after the start of Terminal Server.
Connect to a Server Where YSoft SafeQ Management Service Is Installed and Verify the etcd Cluster's Health Again.
Start Command line (CMD) and navigate to the "<install_dir>\SPOC\terminalserver\etcd\" folder.
Run this command:
etcdctl.exe --endpoint http:
//10.0.5.217:2377 cluster-health
Replace 10.0.5.217 with the actual IP of a server where is Spooler Controller still functional or where the configuration was not changed.
The result should look like this:
member 188abf215116e622 is healthy: got healthy result from http:
//10.0.5.219:2377
member 4698d36b2a32ca93 is healthy: got healthy result from http:
//10.0.5.218:2377
member 5df1a03e6509526c is healthy: got healthy result from http:
//10.0.5.217:2377
cluster is healthy
The node is now reconfigured.
When the Terminal Server etcd Cluster Is Unhealthy
Unfortunately, you cannot add or remove nodes if the Terminal Server etcd quorum was lost. You can only recreate the Terminal Server etcd cluster again.
All data stored inside the Terminal Server etcd cluster will be lost, so you will need to reinstall all affected YSoft SafeQ Embedded Terminals after cluster recreation.
Stop the YSoft SafeQ Terminal Server service on all nodes in the affected Spooler Controller Group.
Back up the folder TS-XX.XX.XX.XX in "<install_dir>\SPOC\terminalserver\etcd\" on all nodes.
Delete the folder TS-XX.XX.XX.XX in "<install_dir>\SPOC\terminalserver\etcd\" on all nodes.
Start the YSoft SafeQ Terminal Server service on all nodes.
Verify that "Offline storage refreshed" can be seen in the Terminal Server log after the start of Terminal Server (it might take up to 10 minutes before this record appears)
Reinstall all YSoft SafeQ Embedded Terminals that are managed by the affected Spooler Controller Group.