DO NOT AUTOSCALE : PBCTF 4.0 RCA

The Incident

On 02.08.25, during PBCTF 4.0 - our college club Point Blank's flagship CTF competition - we experienced what every CTF organizer fears: a complete platform outage. For 35 agonizing minutes from 10:00AM to 10:35AM IST, Me and Govind tried bringing the service back up while 45-50 of our other teammates made memes and helped calm the revolting participants.

The incident was not a complete outage as we had a backup ctfd deployment running on our own PB Server, pointing to a different DNS, to which we directed traffic. Sadly, the PB Server was not powerful enough to handle all 400 participants but it did take care of approximately half of them while the rest of PB Members distracted the other half set of participants.

Credits to our senior Ashutosh Pandey for the idea to keep a backup deployment ready.

Introduction

I am Akash Singh, a final year engineering student and Open Source Contributor from Bangalore.
Here is my LinkedIn, GitHub and Twitter

I go by the name SkySingh04 online.

Timeline of Events

T-0: The Calm Before the Storm

The opening ceremony went great and we were ready to get the CTF Started. Both deployments were healthy and participants were logging on and getting ready for the CTF.

Our GCP Kubernetes cluster was a sweet deal, we spent 1000 INR to get 25000 INR in GCP credits and felt invincible with all that cloud power at our disposal.

T+5 minutes: "Let's Scale!"

Traffic was picking up. In a moment of what seemed like devops big brain time (it wasn't), I decided to scale up our pods. After all, who doesn't love autoscaling? 25000 INR needs to be used afterall.

# What could possibly go wrong?
spec:
  replicas: 2  # YOLO
Me Chilling after ""AutoSCAllINg""

T+10 minutes: The First Signs of Trouble

New pods started spinning up. They attempted to run database migrations - standard procedure, right? Wrong. The pods immediately crashed with migration errors.

T+15 minutes: Panic Mode Activated

Our initial hypothesis: "Must be a migration issue!"

The night before, Govind said he ran some custom migrations. Maybe something was incompatible? In my big brain, I decided to delete all pods and start fresh.

kubectl delete pods --all -n ctfd
# Famous last words: "This will fix it"

T+16 minutes: Full Outage

Congratulations, we played ourselves. All pods gone. CTFd completely down on k8s deployment and then all of PB started calling at the same time :

The Backup That Saved Us

Here's where we owe a massive thank you to Ashutosh bhaiya for his prescient advice: "Always keep a backup deployment ready."

We had a secondary CTFd instance running on our PB server - completely separate from the Kubernetes cluster but connected to the same database. While Govind and I battled with the K8s deployment, this backup instance kept serving traffic to some participants.


The Real Culprit

After 30 minutes of debugging, checking migration scripts and nearly crying to claude for a fix, I saw one emergency pod created after some random claude commands finally healthy.

After everything settled down and we sat down for the RCA, we found out the real issue.

"We were using two different versions of CTFd from two different servers"
  • Lavi's deployment : On PB Server: Latest CTFd version
  • Govind's deployment : On GCP K8S: CTFd 3.7.2

Both connecting to the same database. Both trying to run different migration schemas. Both convinced they were right.

And in honour of this moment, This meme was born : https://x.com/SkySingh04/status/1951981237419258332


What We Should Have Done

1. Proper Load Balancing Architecture

Instead of our ad-hoc "some users go here, some go there" approach, we should have implemented:

# Ideal setup with external load balancer
External DNS (Cloudflare/Route53)
         |
    Load Balancer
    /           \
K8s Cluster    PB Server
    \           /
   Shared Database

This would have:

  • Automatically distributed traffic between both deployments
  • Provided seamless failover when K8s went down
  • Prevented the full outage scenario

2. Test Your Autoscaling (Or Don't Use It)

Let's be honest - we enabled autoscaling because it sounded cool. Did we need it? Probably not. Did we test it? Definitely not.

Lessons learned:

  • Autoscaling is not a silver bullet
  • Test scaling behaviors in staging first
  • Sometimes, vertical scaling > horizontal scaling for stateful apps

Post-Mortem Action Items

  • [ ] Apologise to Cyber Team
  • [ ] Gaslight the participants that it wasn't a server issue, it was their internet that was flaky
  • [ ] Remind myself to always use the latest version everywhere
  • [ ] Beat up Govind for not telling me about the custom migration script he ran last night secretly until the autoscale pods crashed.

Past the memes

PBCTF 4.0 taught us that running a CTF competition is as much about infrastructure resilience as it is about challenge quality. While our participants experienced 35 minutes of downtime, I think it was their internet at fault afterall.

To all future CTF organizers: Learn from our mistakes. Pin your versions. Test your scaling. And maybe, just maybe, don't autoscale unless you really need to.