Degraded performance on the BLiP

Incident Report for Blip

Postmortem

Hi blipper!

On 09/08/2021, Wednesday, we faced an unavailability on Blip that affected the operation of many Smart Contacts.

To be transparent with you, Blip user, we are writing to tell you what happened.

What happened

A failure in connection with our cache service was identified, which is used to store contact information, causing an interruption in the exchange of messages from our smart contacts.

How this issue impact you

Because of this failure, the Blip CRM, our customer base management functionality, faced problems in executing message exchanges.

In the name of Take Blip, I want to say sorry for any problems caused to you, your company and your customers.

What we do to solve it

As soon as we identified the problem at 8pm -3UTC, our team put together and acted quickly to start the treatments. The correction was immediately applied and the service was normalized at 10:36pm -3UTC.

Where we are now

The Blip CRM is working again, and our technical team is following up with the cloud provider to identify the main cause of the failure. In addition, we have internal actions to prevent events like that again.

You can check this history and all other Blip features status on our Status Page.

We also want to say thank you for your patience and remind you that we are always here to help you in any need. Just open a request on our Support or create a new topic on Blip Forum, the exclusive space to the whole users' community.

Sincerely,

Posted Sep 10, 2021 - 18:33 GMT-03:00

Resolved

Status Update:

Fault identified:

Through our monitoring, it was identified that our Cache service lost connection, as this is where the cache is used to store the information of the contacts that talk to the bots, we had total unavailability in message traffic in the period.

Palliative Correction: As a workaround, the counter generation service that was generating the connection leak in our cache server was restarted. After the palliative action, the smart contacts returned to respond.

Start date/time: 11:25 PM
End date/time:11:52 PM

Actions in progress: Uploading a fix in production to the service that is failing.

Root cause: leak connection with caching server.

Posted Sep 09, 2021 - 00:14 GMT-03:00

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 08, 2021 - 23:42 GMT-03:00

Update

Identified:

It was identified by the monitoring team that service cache application lost the connection of all nodes, as it is in it that the cache is used to store the information of the contacts that talk to the bots, causing total unavailability in the message traffic.

Impact:

Smart contact stopped responding.

Solution:

Restarted the storage service and updated the default cache server validation time.

Start date/time: 08/09/2021 08:20 PM
End date/time: 08/09/2021 10:36 PM

Posted Sep 08, 2021 - 23:16 GMT-03:00

Monitoring

Our team identified a flaw in our platform caching service.
Service intervention was performed and after intervention, the bots responded again.
More details about the failure will be informed in our postmortem

Posted Sep 08, 2021 - 22:40 GMT-03:00

Update

Our platform team continues to investigate the scenario.

Posted Sep 08, 2021 - 21:33 GMT-03:00

Update

Impact:

Bots may not respond to user interactions. All channels are being affected

Posted Sep 08, 2021 - 20:51 GMT-03:00

Investigating

We are suffering a degradation in the performance of the BLiP platform, our technical team is already working on the case.

Posted Sep 08, 2021 - 20:37 GMT-03:00

This incident affected: Hosting Business (Bot Builder, Bot Router), Blip Platform (CRM, Core, Analytics, Artificial Intelligence, Portal, Cloud Infrastructure), Desk, and Hosting Enterprise (Bot Builder, Bot Router).