NML

Incident reports

Posted on 09 March 2020
Charl Marais

Overview

Production issues do and will continue to occur. As a service provider, it is our responsibility to help our clients understand the scope, reasons, impact, and resolutions to production issues. We do this with incident reports, which serves as a formal communication to clients.

Incident reports should be completed by the team, reviewed by senior management (Paul, Dave or Charl) and sent to the client within three (3) working days of the incident occurring. A copy should be stored on in the appropriate sub-folder under the Incident Reports folder, which can be found under nml.co.za (team) -> NML Incident Reports👉🐞 -> Files (tab).

Criteria for Incident Reports

An incident report is required when:

Incident Report Form

Online form

You can also find the above form in Teams under nml.co.za -> NML Incident Reports👉🐞 -> Incident Report Form (Tab)

Legacy forms

Template form on Teams

Template form on Sharepoint

A template can be found at the links above. The document sections are:

Section Description Example
<Client_Name> The name of the client for who the incident report is being created.
Overview Summary details about the incident in a table format.
Detailed Description A description of the issue and the timeline of the key events. Alerts of a RabbitMQ Connection Recovery Error exception and an EventStore Timeout exception were emailed to the Graphite team and sent through to the prod-alerts slack channel at 13:19 SA time. The issue was presenting itself on ODATA calls. Saffia, who was on support, acknowledged the issue and alerted the client on the productionsupport and prod-alerts slack channel at 13:29 SA time, as well as the Graphite PM Eugenie, and started investigating the issue. The issue was resolved within 38 minutes.
Impact How the client and/or system was impacted. Clients were unable to submit instructions for the duration of the down time and instructions that were in the Receive endpoint at that time were not processed.
Root Cause Analysis What investigations tasks were done to figure out the cause of the issue The issue presented itself as a RabbitMq Recovery exception on the endpoints and an EventStore Timeout exception on the web api. To understand what was happening we looked at the production application insight logs and remoted into the EventStore virtual machine to look at the EventStore logs and check the status of EventStore. We determined that EventStore was not active or running. We also checked the connection between web api and EventStore to ensure that there was no VPN Gateway issue.
Corrective Action What actions were taken to resolve the issue? To fix the issue, we had to restart the EventStore service and all the endpoints. This took a few minutes. On completion, the client was immediately informed and was satisfied that OData calls were returning successfully. The client was also informed of instructions that were rejected, which were logged on the production service desk with the message ids and correlation ids, and asked to please resubmit.
Preventative Measures What measures will NML put in place to assist in the prevention, identification and better handling of the issue re-occurring? Investigate and discuss with the client a timeline to get the EventStore architectural change to Production as soon as possible. EventStore, which has been problematic, has been swapped out with CosmosDb, which has been more reliable in our QA environment. This change needs to be promoted to Production to improve stability.
An error has occurred. This application may no longer respond until reloaded. Reload