System Goes Down Somewhere Between Guilin and Guangzhou

It’s been a while since I last wrote here.
Not because nothing was happening. Quite the opposite.

The past few months were mostly spent working on Ordina and pushing fairly deep changes into the system. We shipped what internally became version 1.5 — a restructuring that touched almost every codepath in the platform.

We moved away from the older item structure into a more flexible category-based model, added support for variants, reworked parts of the POS flow, and somehow still kept backward compatibility with older app versions.

All of this happened during Ramadan. I’ll probably write about that separately.

But after weeks of long nights, production anxiety, and migration-related edge cases, things finally started to feel stable again.

Or stable enough.

I’m currently in China for family-related business.

In the afternoon of Apr 14, somewhere on the highway between Guilin and Guangzhou, after a surprisingly calm morning in the countryside, I got a call.

It was around 8AM Libya time. A call at this hour usually means something is wrong.

Not a message. Not “check this when you’re free.”.

A call

The customers suddenly couldn’t make it past the initial loading screen of the app. At first, the issue didn’t make much sense.

The backend was healthy. APIs were responding. Nothing looked down.

Which, operationally, means the system is down whether your monitoring agrees with you or not.

The situation around me wasn’t ideal either.

My laptop battery was already dead. The phone was on battery saver mode so I could not access my mail inbox. Most Google services barely worked without VPN access. And we were still hours away from Guangzhou.

So the debugging process became slow and awkward.

Trying to find the root cause using whatever I could inspect from my phone somewhere in southern China was painful.

The actual issue turned out to be surprisingly small.

Our image CDN became inaccessible.

The root cause itself was even smaller: a Google Cloud payment did not go through.

Normally this should have been caught almost immediately through email notifications. Except I never saw the emails.

Google services in China already required VPN access, I had a long trip with an upcoming meeting so I set my phone on battery saver, and email sync had stopped in the background.

So the alert technically existed. I was just unaware of it.

And unfortunately, parts of the client app loading flow implicitly depended on those images being available.

So even though:

the backend was healthy
the APIs were responding
the ordering logic itself was functioning correctly

The user experience effectively collapsed.

Not because the core system failed. But because a dependency that was supposed to be non-critical quietly became critical over time.

What made this more frustrating is that this was not the first time the image CDN caused issues.

The frontend was already supposed to gracefully handle image failures instead of blocking the user flow entirely. Funny enough, the first time this exact issue showed up was during Ramadan while I was praying Taraweeh. Apparently production incidents in Ordina prefer inconvenient timing.

However, somewhere between implementation details, assumptions, and QA, that behavior slipped through.

And we stumbled on it eventually in production.

What stood out to me wasn’t the outage itself.

It was how small the real problem was compared to the operational impact.

One failing edge service. One missing fallback. And suddenly a functioning system starts behaving like a dead one.

You can have:

healthy servers
normal database metrics
successful API responses
stable business logic

and still end up with a product people simply cannot use.

I ended up routing the images through another path and things recovered in roughly half an hour.

Not catastrophic. Not even a particularly large outage. It was early morning after all.

But, I have to say. There’s something strange about debugging production systems while moving across high-speed highways in China with a dead laptop and unreliable connectivity. The chances of experiencing such situations are almost zero.

This incident reinforced something I have always yapped about to my colleges while designing solutions and writing code.

Murphy's Law
anything that can go wrong, will go wrong.

I have witnessed enough situations to say that real-world conditions will eventually expose every assumption you forgot you made.

Production has a way of finding the weakest point in the room.

Usually at the worst possible moment.