Archive Sites down due to Azure issues in East US Region
Incident Report for Pugpig
Postmortem

We have received the following Root Cause Analysis from Microsoft:

The Microsoft Azure Team has investigated the issue reported regarding the HTTP 500 level errors that your app experienced.

On 04/06/2023, App Service rolled out a configuration change to our scale units in East US and North Central US. 
The config change was part of our platform upgrade and was performed to enhanced reliability and security on our scale units.
Unfortunately on a subset of our scale units, this change impacted the ability of the front ends to access the storage subsystem. As a result, your app might have experienced read/write access failures to files.

The issue was automatically detected and the upgrade was immediately paused. To mitigate the issue, engineers  reverted the config change on all the impacted scale units.
Additionally, we have have setup verification to ensure that all the impacted apps are mitigated.

We are continuously taking steps to improve the Azure Web App service and our processes to ensure such incidents do not occur in the future, and in this case that includes (but is not limited to):

  • Enhanced monitoring and notification of instability in the storage subsystem
  • Enhanced testing to ensure any potential issues with config change roll our are identified early

We apologize for any inconvenience.

Posted Apr 07, 2023 - 19:57 UTC

Resolved
Microsoft has confirmed that this is resolved. Apologies for the service interruption.
Posted Apr 06, 2023 - 21:12 UTC
Monitoring
The archive sites all appear to have returned to normal, so it seems Microsoft have fixed things. However, we are still waiting to hear more from their support team so will continue to monitor them.
Posted Apr 06, 2023 - 20:30 UTC
Update
We have received the following update from Microsoft Support:

STATUS:
In-Progress 4/6/2023, 1:30:06 PM UTC

SUMMARY OF IMPACT:
Impact Statement: Starting at 11:50 UTC on 06 Apr 2023, you have been identified as a customer using App Service in East US or North Central US who may receive intermittent HTTP 500-level response codes, experience timeouts or high latency when accessing App Service (Web, Mobile, and API Apps), App Service (Linux), or Function deployments hosted in this region.

Current Status: Our investigation suggests that a recent deployment task on a backend instance that App Service utilizes became unhealthy, causing users to experience the issues mentioned above. The root cause has been identified, mitigation strategies are currently being evaluated and the fix is currently being prepared. Next update will be provided in 60 minutes, or as events warrant.

NEXT STEPS:
We are urgently investigating the root cause and coming up with a plan to resolve as soon a possible. We'll send an update once we know more.
Posted Apr 06, 2023 - 18:36 UTC
Identified
We are expecting an update from Microsoft Support on this in the next hour.
Posted Apr 06, 2023 - 18:24 UTC
Update
Microsoft are still investigating. There seem to be a few different issues happening. At present, about half of the Pugpig Archive sites are fine (the ones in a different region). Of the broken ones, half are displaying a "The page cannot be displayed because an internal server error has occurred." fatal error, while the others are always displaying the Page Not Found page.

Other users are also complaining about this outage on Twitter, although the Microsoft Azure status page (https://azure.status.microsoft/) does not have any info as of now.
Posted Apr 06, 2023 - 16:35 UTC
Investigating
Many of our Archive sites are currently down or returning errors due to a Microsft Azure East US issue . We currently have an open ticket with Microsoft. The latest update:

"As we had discussed on this call, I looked into the service described in this support ticket for any potential issues from the application. At this time, the pattern I am noticing is the same as others in this region. Our engineers are currently investigating the cause of this issue and are working to mitigate this as soon as possible."
Posted Apr 06, 2023 - 15:52 UTC
This incident affected: Pugpig Archive (PDF Ingestion, Pugpig Archive Editor, Archive Web Sites).