Operator fusion: hand-writing a kernel the compiler refused to

When the graph optimizer leaves performance on the table, and how to claw it back without a full custom backend.

Graph compilers are very good and occasionally stubborn. This is a note on a case where the optimizer left a clear fusion opportunity on the table, and what it took to recover it without writing a whole custom backend.

The symptom

Profiling a small edge inference graph showed a chain of elementwise ops between two heavier kernels, each round-tripping activations through memory. On a bandwidth-bound target, that memory traffic — not the math — was the cost.

Why the compiler didn't fuse

The pattern sat across a boundary the default fusion rules treated as opaque. Technically fusible; just outside what the heuristic would attempt.

The fix

Rather than fork the backend, we wrote a single fused kernel for the elementwise chain and registered it as a custom op the runtime could schedule in place of the original subgraph.

Activations stayed resident instead of spilling between every op.
The heavier kernels on either side were untouched — low blast radius.
The change was isolated to one op definition, easy to test and revert.

Result

The fused kernel removed the redundant memory traffic and the end-to-end latency dropped accordingly, with bitwise-equivalent outputs on the validation set.

Takeaway

Read the profile before you trust the compiler. When the bottleneck is memory traffic from un-fused elementwise chains, a single hand-written fused kernel is often the highest-leverage, lowest-risk change available.