Troubleshooting and rescue is the engagement shape we take on when the system is already broken: performance that has quietly degraded, intermittent bugs that resist reproduction, systems with load-bearing parts that nobody on the current team wrote, and the long shadow that team turnover casts over a codebase. The work runs as the rescue side of our broader practice.
Most rescues start with a small triage block (typically eight hours, sometimes a same-day half-hour call first if the system is actively on fire) to understand the problem before scope is committed. By the end of triage you have either a fix, a clear remediation plan with effort estimate, or a recommendation that this isn't the right shape of engagement for us. All three are useful outcomes.
The principle. Most hard bugs become tractable once you stop trusting the team's mental model and look at what the system is actually doing: logs, metrics, traces, the code itself. The skill is knowing which evidence to gather and what it really tells you.
For the full picture (including the four situations the practice is built for, what an audit covers in detail, and indicative pricing) see Codebase Audit, Rescue & Technical Due Diligence. Production rescue is the rescue scenario on that page.
Frequently Asked Questions
Can you do this in an emergency?
Yes. Urgent rescue engagements start with a short triage: typically a half-hour call and an eight-hour triage block to understand what is actually breaking before scope is committed. Out-of-hours work uses the published rate multipliers, which matters when production decides to fail at midnight on a Saturday.
How do you start when the team can't reproduce the bug?
By reading what the system actually does in production rather than what people remember it doing: logs, metrics, traces, the code itself. Most hard bugs become tractable once you stop trusting the team's mental model and look at the evidence directly.
Do you only fix the immediate problem, or also the underlying issue?
Both. The immediate fire gets put out so the system is stable, then we document the root cause and what would prevent the same problem recurring. Whether your team takes on the longer-term remediation or we do it is a decision for the engagement, not a default.