· 4 min read

6 months of uptime, then SIGILL - how Google quietly swapped our CPUs

cloud-rundebuggingllama-cppgcpai-infrastructure
HAIT Cloud & DevOps Consulting

We had an AI chat service running on Google Cloud Run. Python, FastAPI, llama.cpp with a Llama 3.2 3B model, ChromaDB for vectors, the usual stack. It ran fine for six months. Nobody touched it.

Then one day, every cold start began crashing with SIGILL (signal 4). Illegal instruction.

No code changes. No dependency updates. Nothing.

What the crash looked like

The logs showed the LLM engine starting its warm-up inference, and then… nothing. No Python traceback, no error message. The process just died. Cloud Run killed it after the startup probe timed out.

The exit signal was SIGILL - the CPU hit an instruction it couldn’t execute. That’s not something you see in normal Python code. It means something went wrong deep in a C/C++ extension.

Wrong guesses

Corrupted model? No. The model loaded fine every time. The crash happened during the first actual inference call, not during loading.

Out of memory? Service had 8Gi allocated, peak usage was around 4Gi. Not even close.

Dependency changed? The base Docker image was built back in August 2025 and never rebuilt. Same wheel, same versions.

Thread safety? This was the interesting one. The crash happened in a worker thread. But when I ran the exact same inference code in a separate subprocess, it worked. Same container, same model, same code. Different process.

That last bit was the clue.

What was actually happening

The difference between the main process and the subprocess was what else was loaded. The main process had PyTorch loaded (for a toxicity detection model). The subprocess didn’t.

Both PyTorch and llama.cpp use OpenMP and both query CPU features at startup to pick optimized code paths. When I checked the CPUID flags inside the container:

grep -o 'avx512[a-z_]*' /proc/cpuinfo | sort -u
avx512bw
avx512cd
avx512dq
avx512f
avx512vl

Intel Sapphire Rapids. Google had swapped the hardware in europe-west2 sometime between September 2025 and February 2026.

Here’s the problem: Sapphire Rapids physically has AVX-512 and correctly reports it in CPUID. But Google Cloud Run disables AVX-512 execution on these CPUs.

So ggml (the compute library inside llama.cpp) reads CPUID, sees AVX-512 flags, picks the AVX-512 code path for matrix math, and then… SIGILL. The CPU says “I can do this” but when you actually try, it throws an illegal instruction.

This is a known issue that hits multiple projects. Any library doing CPUID-based dispatch (catboost, FAISS, ggml, OpenBLAS) can run into this on Cloud Run.

Why it worked before

Before the hardware swap, europe-west2 probably had Cascade Lake or Ice Lake CPUs that either didn’t have AVX-512 at all, or had it actually enabled. Google swapped hardware without any notification. No changelog entry, no email, nothing. The service just started dying.

The fix

Rebuild llama_cpp_python with AVX-512 explicitly disabled:

docker run --platform linux/amd64 \
  -v "$(pwd)/assets:/out" \
  python:3.10-slim bash -c "
    apt-get update && apt-get install -y build-essential cmake gcc g++ &&
    CMAKE_ARGS='-DGGML_NATIVE=OFF -DGGML_AVX512=OFF' \
    pip wheel llama-cpp-python==0.3.5 --no-deps -w /out
  "

Two flags matter here:

  • DGGML_NATIVE=OFF - don’t optimize for the build machine’s CPU
  • DGGML_AVX512=OFF - skip AVX-512 paths even if CPUID says they’re available

Then drop the wheel into the Docker image:

COPY assets/llama_cpp_python-0.3.5-cp310-cp310-linux_x86_64.whl /tmp/
RUN pip install --no-deps --force-reinstall /tmp/llama_cpp_python-*.whl

Now ggml falls back to AVX2, which Sapphire Rapids fully supports. Performance difference for a 3B model is negligible.

Takeaways

If you’re running native code (llama.cpp, PyTorch, numpy with MKL, whatever) on serverless, always compile with -DGGML_NATIVE=OFF or equivalent. Don’t let the binary auto-detect CPU features at build time. The machine you build on and the machine you run on are not the same, and the cloud provider can swap hardware whenever they want.

SIGILL with no Python traceback always means the crash is in a C extension. Python never gets a chance to catch it.

And if your service suddenly breaks after months of stability with no code changes - check if the underlying hardware changed. On serverless, you don’t control that. And nobody will tell you when it happens.


Need help debugging cloud-native AI deployments? Get in touch.