Rob Allen
Jonas Cronholm Lundin

Incident management at Swish - a logistics problem

Jonas Cronholm Lundin, Head of Platform at Swish, discussed the evolution of incident management at the company during the Netnod Meeting 2026. His presentation detailed their transition from an outsourced, reactive IT model to a highly automated, data-driven approach based on Site Reliability Engineering and GitOps.

This post is part of a series summarising presentations from the Netnod Meeting 2026. You can find links to the full presentation and slides at the end of this article.

Identifying the visibility gap

Launched in 2012 as a collaboration between major Swedish banks, Swish has become a cornerstone of the Swedish financial ecosystem. However, by 2021, the company faced significant operational hurdles.

At the time, incident management and infrastructure were entirely outsourced. The model relied on external service desks located elsewhere in Europe, which created a significant visibility gap: the teams answering incident reports often lacked the contextual knowledge of the Swish service required to resolve issues effectively.

This lack of internal control resulted in operational risks, with major media outlets often reporting outages before the company could identify them. Furthermore, without clear root-cause data, engineering teams struggled to identify effective remediation leading to high rollback rates for software releases. During this period, 70-80% of Swish's monthly software releases had to be rolled back because errors were only discovered days later.

Establishing new operational principles

Recognising that theoretical risk matrices were failing to prevent real-world outages, Swish initiated a fundamental shift in 2022. They adopted a "first principles" approach, centring their operations on modern engineering standards:
 

  • GitOps: managing internal change management via code.
  • SRE: using Site Reliability Engineering (SRE) as the operational standard.
  • SLSA Frameworks: integrating Supply-chain Levels for Software Artifacts (SLSA) to ensure build provenance and security.

Crucially, Swish inverted their approach to stability. Rather than relying solely on pre-release testing and approval meetings, they prioritised the foundation: comprehensive monitoring and observability in production.

Data-driven visibility and response automation

To address their "blind spots," Swish implemented end-to-end tracing across their entire infrastructure. This required close collaboration with their partners to deploy custom monitoring agents.

The result was transformative. By tracking payment conversions in real-time (measuring initialised payments against successful settlements), Swish gained actionable signal data. This enabled the team to move away from guesswork and take immediate, data-backed action when anomalies occurred.

Using this data, Swish restructured their incident lifecycle. They replaced text-heavy, traditional  documentation with actionable, graphical runbooks structured around the "OODA loop" (Observe, Orient, Decide, Act). This framework provides incident commanders with a clear, structured path during a crisis.

Swish also automated incident communication. The system now automatically detects when a participating bank is experiencing issues and pushes graphical notifications directly to user apps.

Finally, Swish overhauled their post-incident learning process. They replaced traditional ITIL problem tickets with transparent, weekly post-mortem reviews involving stakeholders from across the company. By sharing these operational insights with participating banks, Swish is strengthening the collective resilience of Sweden's financial infrastructure.

You can watch Jonas Cronholm Lundin’s presentation from the Netnod Meeting 2026 here and view the slides here.

 

Related blog articles

Show all blog articles