Basecamp 3 email in6/26/2023 This cascades catastrophically when those waiting on the lock are consuming limited database resources, or even holding locks of their own that others are waiting on, causing an exponentially worsening pile-up. Like a crowd moving single-file through an exit, and someone stops to tie their shoe on their way through the door. Locking purposefully limits concurrent operations, resulting in bottlenecks where one slow operation that's holding a lock blocks all the others who are waiting on it. Locks ensure data consistency and durability by ensuring that readers and writers proceed in an orderly manner. To coordinate all this activity happening in parallel, the database uses a variety of "locking" mechanisms to govern concurrent data access. Databases are built to handle precisely this sort of workload. We run at very high concurrency, with thousands of database connections from our Rails application servers and hundreds of execution threads vying for processing time on a very beefy many-CPU system. Basecamp 3 is growing steadily and rapidly, inching the primary database ever closer to its limits, and Monday, June 27 was simply the first day that it stepped just beyond-the tipping point.įor our primary database, the cascade began with internal lock contention. No particular difference or trigger event is needed, as much as we'd like to find one. Why didn't this happen last week, or the week before, or next week? What was different about this week? That's the nature of a tipping point: it creeps up on you. We’re dealing with a “tipping point” where stepping just beyond a system's limits sends it from stable to unstable in a flash. Trouble is, there’s no single, clear culprit to pin down. We've been deep in investigation, research, and remediation, which delayed this report well past its due with the expectation that we'd find a definitive answer. Why it happenedĭiagnosing this database overload has frustrated our efforts at root-cause analysis. By this point, we had clocked in 103 minutes (1.7 hours) of cumulative downtime. Eventually, plan D worked and we restored Basecamp 3 to full service at 15:49 UTC. We responded immediately with a series of escalating countermeasures, but plans A, B, and C each fell through, triggering entirely new errors. Within the span of 30 seconds, between 13:00:00 and 13:00:30 UTC, Basecamp 3’s primary database went from breathing freely to catastrophically congested. Finally, around 15:45 UTC, Basecamp 3 was back in business and stayed that way. ![]() Then-nothing worked at all! Some minutes later, things seemed to be up & running again, only to slow down and start showing errors minutes afterward. Some things did work but took forever, yet Basecamp 2 and Basecamp Classic were behaving normally. What you saw: On Monday, June 27 at 13:00 UTC, Basecamp 3 started getting extremely slow, then started showing error pages. We’re leaning on your goodwill, and we will restore it. ![]() We're sorry to leave you hanging, wondering. We've been sleuthing for the underlying causes of the outage so we can definitively guarantee that it won't occur again, and that's proven elusive. It put some of you in the position of excusing Basecamp's reliability to your clients or colleagues as you rescheduled meetings at the last minute.įurthermore, we're late in sharing an update. We’re deeply sorry for interrupting and delaying your work.įor many of you, this outage ate up an entire morning or afternoon on a Monday, right at the top of the week. Here’s what happened, why, and what we're doing about it. HEY, Basecamp 2, Basecamp Classic, and our other products were unaffected. All told, Basecamp 3 was down for 1.7 hours. ![]() Your account data remained secure, intact, and thoroughly backed-up throughout. It took us nearly three hours to fully restore service. In the span of the next 30 seconds, Basecamp 3 went from fully operational to severely struggling and then down entirely. Basecamp 3 is humming along at peak performance.
0 Comments
Leave a Reply. |