etcd clustering step by step
This is an attempt to document the steps that I tried to understand etcd clustering. I think by following these steps, we are better suited to understand how etcd disaster recovery looks like in practice.
Prerequisites
I added the following into /etc/hosts
so that I can assign dns names to each
of the etcd instances that I’m about to create.
$ cat /etc/hosts
# Static table lookup for hostnames.
# See hosts(5) for details.
127.0.0.1 etcd-1
127.0.0.2 etcd-2
127.0.0.3 etcd-3
The first instance
We can run the following command to start our first etcd instance:
$ etcd --name=etcd-1 \
--listen-peer-urls http://127.0.0.1:2380 \
--listen-client-urls http://127.0.0.1:2379 \
--initial-cluster etcd-1=http://etcd-1:2380 \
--initial-advertise-peer-urls http://etcd-1:2380 \
--advertise-client-urls http://etcd-1:2379
The default value for --name
is default
which doesn’t look great because I
will need to create n
instances, so I decided to name it as etcd-{x}
.
--listen-peer-urls
specifies the address to listen on requests received from
peers (other etcd instances), and --listen-client-urls
specifies the address
to listen on requests received from clients (for instance, etcdctl
). I could
have just skipped these two flags, but for consistency and better clarities, I
decided to add them explicitly.
The default value for --initial-cluster
is 'default=http://localhost:2380'
.
Since I changed the name and also wanted to use a specific dns name, I need to
override it properly.
--initial-advertise-peer-urls
and --advertise-client-urls
specifies how this
instance expects other peers and clients to talk to it.
Now let’s store some data into it:
$ etcdctl --endpoints=http://etcd-1:2379 put foo bar
OK
Add the second instance
It is natural to just try the same command with few modifications. Let’s try it out:
$ etcd --name=etcd-2 \
--listen-peer-urls http://127.0.0.2:2380 \
--listen-client-urls http://127.0.0.2:2379 \
--initial-cluster etcd-2=http://etcd-2:2380 \
--initial-advertise-peer-urls http://etcd-2:2380 \
--advertise-client-urls http://etcd-2:2379
The command runs fine, but it just started another new etcd instance which is not connected to the first instance at all:
$ etcdctl --endpoints=http://etcd-2:2379 get foo
This command above gives nothing, which we would expect to see bar
.
In order to connect to the first instance, we should specify its peer address in
--initial-cluster
. Let’s add etcd-1
as well:
You may have to run
rm -drf etcd-2.etcd
to clean up first.
$ etcd --name=etcd-2 \
--listen-peer-urls http://127.0.0.2:2380 \
--listen-client-urls http://127.0.0.2:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380 \
--initial-advertise-peer-urls http://etcd-2:2380 \
--advertise-client-urls http://etcd-2:2379
This takes us further, but it errors crazily with message below:
{
"level": "warn",
"ts": "2023-06-17T21:26:46.212008-0700",
"caller": "rafthttp/stream.go:653",
"msg": "request sent was ignored by remote peer due to cluster ID mismatch",
"remote-peer-id": "a3959071884acd0c",
"remote-peer-cluster-id": "572a811fac16a854",
"local-member-id": "97ecdd3706cc2d90",
"local-member-cluster-id": "b2698d4259f33e40",
"error": "cluster ID mismatch"
}
It means that this instance tried to create another cluster which isn’t expected.
We can fix this by add --initial-cluster-state existing
.
$ etcd --name=etcd-2 \
--listen-peer-urls http://127.0.0.2:2380 \
--listen-client-urls http://127.0.0.2:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380 \
--initial-advertise-peer-urls http://etcd-2:2380 \
--advertise-client-urls http://etcd-2:2379 \
--initial-cluster-state existing
This command still failed to bring up the second instance. The error message is:
{
"level": "fatal",
"ts": "2023-06-17T21:30:36.560232-0700",
"caller": "etcdmain/etcd.go:204",
"msg": "discovery failed",
"error": "error validating peerURLs {ClusterID:572a811fac16a854 Members:[&{ID:a3959071884acd0c RaftAttributes:{PeerURLs:[http://etcd-1:2380] IsLearner:false} Attributes:{Name:etcd-1 ClientURLs:[http://etcd-1:2379]}}] RemovedMemberIDs:[]}: member count is unequal",
"stacktrace": "go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"
}
This basically means that the member count doesn’t match. We specified 2 in our
flag --initial-cluster
, but there is only one member which is etcd-1
:
$ etcdctl --endpoints=http://etcd-1:2379 member list
a3959071884acd0c, started, etcd-1, http://etcd-1:2380, http://etcd-1:2379, false
To fix this, we can add etcd-2
manually with:
$ etcdctl --endpoints=http://etcd-1:2379 member add etcd-2 --peer-urls http://etcd-2:2380
This is going to disrupt the service status of
etcd-1
due to loss of quorum becauseetcd-2
is not yet up.
Afterwards, we should be able to bring up the second instance by re-running the
command that failed previously. And we should be able to query from etcd-2
as
well:
$ etcdctl --endpoints=http://etcd-2:2379 get foo
foo
bar
Add another instance
The step should now be straightforward:
$ etcdctl --endpoints=http://etcd-1:2379 member add etcd-3 --peer-urls http://etcd-3:2380
$ etcd --name=etcd-3 \
--listen-peer-urls http://127.0.0.3:2380 \
--listen-client-urls http://127.0.0.3:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-3:2380 \
--advertise-client-urls http://etcd-3:2379 \
--initial-cluster-state existing
$ etcdctl --endpoints=http://etcd-3:2379 get foo
foo
bar
Avoid adding members manually
The steps demonstrated above seem to be tedious because we have to add each member manually before we can start the instance. If we know all the members beforehand, we can avoid doing that manually. To run the same setup, we can just do the following:
You may need to clean up everything we did above by running
rm -drf etcd-*
.
$ etcd --name=etcd-1 \
--listen-peer-urls http://127.0.0.1:2380 \
--listen-client-urls http://127.0.0.1:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-1:2380 \
--advertise-client-urls http://etcd-1:2379
$ etcd --name=etcd-2 \
--listen-peer-urls http://127.0.0.2:2380 \
--listen-client-urls http://127.0.0.2:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-2:2380 \
--advertise-client-urls http://etcd-2:2379
$ etcd --name=etcd-3 \
--listen-peer-urls http://127.0.0.3:2380 \
--listen-client-urls http://127.0.0.3:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-3:2380 \
--advertise-client-urls http://etcd-3:2379
Here are the differences:
- Now all the instances are started with a full list of
--initial-cluster
. - They all start with
--initial-cluster-state new
(the default value).
As part of this change, the first instance won’t be able to take traffic right away because the cluster isn’t healthy yet. The cluster is only able to start serving traffic once the second instance is also up.
But wait, why --initial-cluster-state new
works here? In our previous attempt,
when we join the second instance without setting it to existing
, it complains
about clusterID mismatch
. It turns out that cluster id is actually generated
based on --initial-cluster
. This means that each instance, when they have the
same value of --initial-cluster
, they’ll come up with the same member id as
well as cluster id. We can examine the cluster id by trying the command above
multiple time and we can expect that cluster id is always the same:
$ etcdctl --endpoints=http://etcd-1:2379 member list -w json | jq '.header.cluster_id'
1176270058202747400
You actually should see the same value if you followed the same steps above.
Fore more information, you may look at computeMemberId and genID.
Failures
There are different failure scenarios:
- Minority failure with data.
- Minority failure without data.
- Majority failure with data.
- Majority failure without data.
Minority failure with data
Let’s imagine a case when etcd-1
(127.0.0.1
) failed, but its local storage
is still available because it uses remote storage. We can simply provision
another instance and let etcd-1
resolves to the new IP address. In my
experiment, let’s assume the new IP address is 127.0.0.4
. We can make this
change by updating /etc/hosts
:
$ cat /etc/hosts
# Static table lookup for hostnames.
# See hosts(5) for details.
127.0.0.4 etcd-1
127.0.0.2 etcd-2
127.0.0.3 etcd-3
Note that etcd-1
now points to 127.0.0.4
. We can now start this instance by:
$ etcd --name=etcd-1 \
--listen-peer-urls http://127.0.0.4:2380 \
--listen-client-urls http://127.0.0.4:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-1:2380 \
--advertise-client-urls http://etcd-1:2379
Minority failure without data
In this case, not only the instance is gone, but its data is also lost. To simulate this case, let’s wipe out the data via:
$ rm -drf etcd-1.etcd
Now if we try the same command as mentioned above:
$ etcd --name=etcd-1 \
--listen-peer-urls http://127.0.0.4:2380 \
--listen-client-urls http://127.0.0.4:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-1:2380 \
--advertise-client-urls http://etcd-1:2379
It is going to fail with:
{
"level": "fatal",
"ts": "2023-06-18T13:16:48.878283-0700",
"caller": "etcdmain/etcd.go:204",
"msg": "discovery failed",
"error": "member a3959071884acd0c has already been bootstrapped",
"stacktrace": "go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"
}
What this message means is that etcd-1
has already been bootstrapped in the
past, it won’t succeed if we try to bootstrap it again. In order to bootstrap it
successfully, we have to reset its state by removing this member:
$ memberid=$(etcdctl --endpoints=http://etcd-2:2379 member list --hex -w json | jq -r '.members[] | select(.name == "etcd-1") | .ID')
$ etcdctl --endpoints=http://etcd-2:2379 member remove $memberid
Now we are back to the original state, which we need to add this member manually
and then start the instance with --initial-cluster-state existing
:
$ etcdctl --endpoints=http://etcd-2:2379 member add etcd-1 --peer-urls http://etcd-1:2380
$ etcd --name=etcd-1 \
--listen-peer-urls http://127.0.0.4:2380 \
--listen-client-urls http://127.0.0.4:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-1:2380 \
--advertise-client-urls http://etcd-1:2379 \
--initial-cluster-state existing
Majority failure with data
Let’s continue the experimentation, and assume we lost etcd-2
and etcd-3
. In
this case, since we still have the data for etcd-2
and etcd-3
, we can simply
launch two new instances. In this experiment, let’s assume the new IP addresses
for them are 127.0.0.5
and 127.0.0.6
respectively:
$ cat /etc/hosts
# Static table lookup for hostnames.
# See hosts(5) for details.
127.0.0.4 etcd-1
127.0.0.5 etcd-2
127.0.0.6 etcd-3
$ etcd --name=etcd-2 \
--listen-peer-urls http://127.0.0.5:2380 \
--listen-client-urls http://127.0.0.5:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-2:2380 \
--advertise-client-urls http://etcd-2:2379
$ etcd --name=etcd-3 \
--listen-peer-urls http://127.0.0.6:2380 \
--listen-client-urls http://127.0.0.6:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-3:2380 \
--advertise-client-urls http://etcd-3:2379
Majority failure without data
It is clear now that we need to remove the previous dead members when we don’t
have its previous data, however we won’t be able to do such operations because
we lost too many instances, and at that point of time, no etcdctl
operations
are possible anymore. This can further be dividied into three different cases:
- There is snapshot saved.
- There is no snapshot, but there is still at least one live instance.
- No snapshot, and no any live instances
With snapshot
Let’s save a snapshot before we simulate failures:
$ etcdctl --endpoints=http://etcd-3:2379 snapshot save snapshot.db
Now let’s stop all instances and also clean the data directories:
$ rm -drf etcd-1.etcd
$ rm -drf etcd-2.etcd
$ rm -drf etcd-3.etcd
Now we can restore the data directory for each of the instance via:
$ etcdctl snapshot restore snapshot.db \
--name etcd-1 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-1:2380
$ etcdctl snapshot restore snapshot.db \
--name etcd-2 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-2:2380
$ etcdctl snapshot restore snapshot.db \
--name etcd-3 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-3:2380
Then we can follow the same steps above just like we have data available:
$ etcd --name=etcd-1 \
--listen-peer-urls http://127.0.0.4:2380 \
--listen-client-urls http://127.0.0.4:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-1:2380 \
--advertise-client-urls http://etcd-1:2379
$ etcd --name=etcd-2 \
--listen-peer-urls http://127.0.0.5:2380 \
--listen-client-urls http://127.0.0.5:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-2:2380 \
--advertise-client-urls http://etcd-2:2379
$ etcd --name=etcd-3 \
--listen-peer-urls http://127.0.0.6:2380 \
--listen-client-urls http://127.0.0.6:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-3:2380 \
--advertise-client-urls http://etcd-3:2379
With live instance
If we don’t have snapshots, but we do still have at least one live instance, we
can still restore this cluster by initializing the state from one of the live
instance. In this experiment, I’m going to stop etcd-2
and etcd-3
and also
remove its data directories:
$ rm -drf etcd-2.etcd
$ rm -drf etcd-3.etcd
Now we can restart etcd-1
with --force-new-cluster
:
$ etcd --name=etcd-1 \
--listen-peer-urls http://127.0.0.4:2380 \
--listen-client-urls http://127.0.0.4:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-1:2380 \
--advertise-client-urls http://etcd-1:2379 \
--force-new-cluster
With this step completed, we can bring back etcd-2
and etcd-3
just like what
we experimented earlier - when we set up clustering manually:
$ etcdctl --endpoints=http://etcd-1:2379 member add etcd-2 --peer-urls http://etcd-2:2380
$ etcd --name=etcd-2 \
--listen-peer-urls http://127.0.0.5:2380 \
--listen-client-urls http://127.0.0.5:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380 \
--initial-advertise-peer-urls http://etcd-2:2380 \
--advertise-client-urls http://etcd-2:2379 \
--initial-cluster-state existing
$ etcdctl --endpoints=http://etcd-1:2379 member add etcd-3 --peer-urls http://etcd-3:2380
$ etcd --name=etcd-3 \
--listen-peer-urls http://127.0.0.6:2380 \
--listen-client-urls http://127.0.0.6:2379 \
--initial-cluster etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380,etcd-3=http://etcd-3:2380 \
--initial-advertise-peer-urls http://etcd-3:2380 \
--advertise-client-urls http://etcd-3:2379 \
--initial-cluster-state existing
If there is no snapshot, nor there is any live instances, then I think we don’t really have any possibilities there.
Summary
I hope you enjoyed this experimentation so far. This post examined several cases from manual clustering setup as well as different disaster recovery scenarios. I wish this offered few insights and also be helpful when you actually need it.