Vault Disaster Recovery in EKS: How to Handle It
# 🔐 Vault Disaster Recovery in EKS: How to Handle It
If you're running Vault in Kubernetes (EKS) using Raft as the storage backend, one of the most stressful moments is:
> ❗ “What happens if a Vault Pod and its PVC are deleted?”
In this post, I’ll walk you through a real-world disaster recovery scenario — where a Vault Pod and its persistent volume go missing — and how you can **quickly and safely recover** your Raft-based Vault cluster.
---
## 📆 Environment Setup (Assumptions)
- HashiCorp Vault 1.19+
- Deployed on EKS using Helm
- Raft as the storage backend (each Vault Pod uses EBS-backed PVC)
- 3 Vault Pods: `vault-0`, `vault-1`, `vault-2`
- NodeGroup located in a single AZ (e.g., `us-west-1`)
---
## 💥 Failure Scenario: Deleting vault-0 Pod + PVC
```bash
kubectl delete pod vault-0 -n vault
kubectl delete pvc data-vault-0 -n vault
```
After deletion, the StatefulSet recreates the Pod — but since the PVC is gone, Vault starts in `Initialized: false` state, and cannot be unsealed.
---
## ⚠️ Common Mistake: Reinitializing Vault
```bash
vault operator init
```
> ❌ **Do NOT run this command!**
This creates a brand-new Vault cluster, which causes **split-brain** and breaks your existing Raft configuration.
---
## ✅ Correct Recovery Procedure
### 1. Join `vault-0` back to the cluster
Run from another Vault Pod (e.g., `vault-1` or `vault-2`):
```bash
vault operator raft join http://vault-0.vault-internal:8200
```
(Use `https://` and `-tls-skip-verify` if TLS is enabled.)
### 2. Check `vault-0` status
```bash
vault status
```
- It should now say `Initialized: true`
- If `Sealed: true`, proceed to unseal
### 3. Unseal (if Auto-Unseal is not configured)
```bash
vault operator unseal
```
Input at least 3 of your original unseal keys.
### 4. Confirm cluster status
```bash
vault operator raft list-peers
```
You should now see `vault-0` listed as a `follower`. ✅
---
## 🔍 Additional Troubleshooting
### Check if still not initialized
```bash
curl -s http://127.0.0.1:8200/v1/sys/init
```
Returns:
```json
{"initialized": false}
```
→ The join likely failed or Vault hasn't synced yet.
### View logs for `vault-0`
```bash
kubectl logs vault-0 -n vault
```
Look for:
```
successfully joined raft cluster
```
---
## 🛡️ Best Practices for Future Resilience
| Task | Why It Matters |
|------------------------------|------------------------------------------------------------------|
| ✅ Use Auto-Unseal | AWS KMS or HSM removes need for manual unseal steps |
| ✅ Periodic Raft Snapshots | Enables easy restore via `vault operator raft snapshot save` |
| ✅ PodDisruptionBudget | Prevents accidental eviction of leader pod |
| ✅ volumeBindingMode tuning | Use `WaitForFirstConsumer` to control EBS zone binding |
| ✅ Spread across AZs | For HA, use 3 AZs with PodAntiAffinity if possible |
---
## 🚀 TL;DR – Recovery Steps
1. Vault Pod (`vault-0`) and its PVC are deleted
2. Rejoin it to the Raft cluster using `vault operator raft join`
3. Unseal the node if needed
4. Verify it's in sync with `raft list-peers`
---
## 🧐 Final Thoughts
Vault is powerful but must be handled carefully when running in Raft mode with EBS volumes.
Losing a PVC doesn't mean you’ve lost your Vault — but only **if you know how to recover it**.
Practice this flow **before** it happens in production.
Recovery is easy once you've done it once.
---
> “The best incident is the one you've already practiced.”
> — A true DevOps mindset
댓글
댓글 쓰기