Production issues do and will continue to occur. As a service provider, it is our responsibility to help our clients understand the scope, reasons, impact, and resolutions to production issues. We do this with incident reports, which serves as a formal communication to clients.
Incident reports should be completed by the team, reviewed by senior management (Paul, Dave or Charl) and sent to the client within three (3) working days of the incident occurring. A copy should be stored on in the appropriate sub-folder under the Incident Reports folder, which can be found under
nml.co.za (team) ->
General (channel) ->
Criteria for Incident Reports
An incident report is required when:
- The client is unable to use any part of the production system that is reasonably expected to be available.
- A system process or a back-end task has failed to operate as expected and had or will have a direct or indirect impact on data or related systems.
Incident Report Form
A template can be found at the links above. The document sections are:
|<Client_Name>||The name of the client for who the incident report is being created.|
|Overview||Summary details about the incident in a table format.|
|Detailed Description||A description of the issue and the timeline of the key events.||Alerts of a RabbitMQ Connection Recovery Error exception and an EventStore Timeout exception were emailed to the Graphite team and sent through to the prod-alerts slack channel at 13:19 SA time. The issue was presenting itself on ODATA calls. Saffia, who was on support, acknowledged the issue and alerted the client on the productionsupport and prod-alerts slack channel at 13:29 SA time, as well as the Graphite PM Eugenie, and started investigating the issue. The issue was resolved within 38 minutes.|
|Impact||How the client and/or system was impacted.||Clients were unable to submit instructions for the duration of the down time and instructions that were in the Receive endpoint at that time were not processed.|
|Root Cause Analysis||What investigations tasks were done to figure out the cause of the issue||The issue presented itself as a RabbitMq Recovery exception on the endpoints and an EventStore Timeout exception on the web api. To understand what was happening we looked at the production application insight logs and remoted into the EventStore virtual machine to look at the EventStore logs and check the status of EventStore. We determined that EventStore was not active or running. We also checked the connection between web api and EventStore to ensure that there was no VPN Gateway issue.|
|Corrective Action||What actions were taken to resolve the issue?||To fix the issue, we had to restart the EventStore service and all the endpoints. This took a few minutes. On completion, the client was immediately informed and was satisfied that OData calls were returning successfully. The client was also informed of instructions that were rejected, which were logged on the production service desk with the message ids and correlation ids, and asked to please resubmit.|
|Preventative Measures||What measures will NML put in place to assist in the prevention, identification and better handling of the issue re-occurring?||Investigate and discuss with the client a timeline to get the EventStore architectural change to Production as soon as possible. EventStore, which has been problematic, has been swapped out with CosmosDb, which has been more reliable in our QA environment. This change needs to be promoted to Production to improve stability.|