Vault Disaster Recovery in EKS: How to Handle It

 # 🔐 Vault Disaster Recovery in EKS: How to Handle It


If you're running Vault in Kubernetes (EKS) using Raft as the storage backend, one of the most stressful moments is:


> ❗ “What happens if a Vault Pod and its PVC are deleted?”


In this post, I’ll walk you through a real-world disaster recovery scenario — where a Vault Pod and its persistent volume go missing — and how you can **quickly and safely recover** your Raft-based Vault cluster.


---


## 📆 Environment Setup (Assumptions)


- HashiCorp Vault 1.19+

- Deployed on EKS using Helm

- Raft as the storage backend (each Vault Pod uses EBS-backed PVC)

- 3 Vault Pods: `vault-0`, `vault-1`, `vault-2`

- NodeGroup located in a single AZ (e.g., `us-west-1`)


---


## 💥 Failure Scenario: Deleting vault-0 Pod + PVC


```bash

kubectl delete pod vault-0 -n vault

kubectl delete pvc data-vault-0 -n vault

```


After deletion, the StatefulSet recreates the Pod — but since the PVC is gone, Vault starts in `Initialized: false` state, and cannot be unsealed.


---


## ⚠️ Common Mistake: Reinitializing Vault


```bash

vault operator init

```


> ❌ **Do NOT run this command!**


This creates a brand-new Vault cluster, which causes **split-brain** and breaks your existing Raft configuration.


---


## ✅ Correct Recovery Procedure


### 1. Join `vault-0` back to the cluster


Run from another Vault Pod (e.g., `vault-1` or `vault-2`):


```bash

vault operator raft join http://vault-0.vault-internal:8200

```


(Use `https://` and `-tls-skip-verify` if TLS is enabled.)


### 2. Check `vault-0` status


```bash

vault status

```


- It should now say `Initialized: true`

- If `Sealed: true`, proceed to unseal


### 3. Unseal (if Auto-Unseal is not configured)


```bash

vault operator unseal

```


Input at least 3 of your original unseal keys.


### 4. Confirm cluster status


```bash

vault operator raft list-peers

```


You should now see `vault-0` listed as a `follower`. ✅


---


## 🔍 Additional Troubleshooting


### Check if still not initialized


```bash

curl -s http://127.0.0.1:8200/v1/sys/init

```


Returns:


```json

{"initialized": false}

```


→ The join likely failed or Vault hasn't synced yet.


### View logs for `vault-0`


```bash

kubectl logs vault-0 -n vault

```


Look for:


```

successfully joined raft cluster

```


---


## 🛡️ Best Practices for Future Resilience


| Task                         | Why It Matters                                                  |

|------------------------------|------------------------------------------------------------------|

| ✅ Use Auto-Unseal            | AWS KMS or HSM removes need for manual unseal steps             |

| ✅ Periodic Raft Snapshots    | Enables easy restore via `vault operator raft snapshot save`    |

| ✅ PodDisruptionBudget        | Prevents accidental eviction of leader pod                      |

| ✅ volumeBindingMode tuning   | Use `WaitForFirstConsumer` to control EBS zone binding          |

| ✅ Spread across AZs          | For HA, use 3 AZs with PodAntiAffinity if possible              |


---


## 🚀 TL;DR – Recovery Steps


1. Vault Pod (`vault-0`) and its PVC are deleted  

2. Rejoin it to the Raft cluster using `vault operator raft join`  

3. Unseal the node if needed  

4. Verify it's in sync with `raft list-peers`  


---


## 🧐 Final Thoughts


Vault is powerful but must be handled carefully when running in Raft mode with EBS volumes.  

Losing a PVC doesn't mean you’ve lost your Vault — but only **if you know how to recover it**.


Practice this flow **before** it happens in production.  

Recovery is easy once you've done it once.


---


> “The best incident is the one you've already practiced.”  

> — A true DevOps mindset

댓글

가장 많이 본 글