Timeout in Builder actions
Incident Report for Blip
Postmortem

Start time: September 30, 2024 10:06 AM

End time: October 15, 2024 5:38 PM

Incident summary:

Since September 30th, we have identified a problem that occurred on the Blip platform, which impacted the execution of actions and the processing of commands in the building blocks. As a result, this led to potential impacts on bots' messaging and publishing of new streams, happening intermittently at specific times throughout the day. After the actions taken by our team, we no longer had the problem.

Impact analysis:

Bot stopping responding/performing actions in the builder, meaning users are unable to communicate effectively with the bot.

What caused the instabilities?

An internal Microsoft update resulted in a change in the security protocol used in the database. During a high volume of operations, a conflict occurred due to the activation of a feature that was not aligned with the new communication standard, resulting in query execution failure.

Actions to be resolved:

Palliative Actions: We increase the resilience of the environment and the application, in addition to optimizing the connection between them.

The definitive correction was carried out in phases, with the migration of the application structure to a new database, which resolved the identified problems and reestablished the environment. We have not recorded any more failures since the last one reported on October 15th at 5:38 pm.

Posted Oct 29, 2024 - 12:30 GMT-03:00

Resolved
After the actions taken by our engineering team we did not notice the scenario occurring again, more details about the problem and the root cause will be posted as soon as possible in the postmortem
Posted Oct 24, 2024 - 11:00 GMT-03:00
Update
Since September 30th, we have identified an issue occurring on the Blip platform, which has impacted the execution of actions and the processing of commands in the builder blocks. As a result, this has led to possible impacts on bot message exchanges and the publishing of new flows, happening intermittently at specific times throughout the day.
Our team is actively monitoring the situation and taking immediate actions to minimize the impact. Among the ongoing actions, we are making targeted interventions in services and adjusting application configurations to ensure the system operates as expected. Simultaneously, dedicated technical teams are implementing automations and various solutions aimed at mitigating the situation and preventing future occurrences.
To definitively resolve the problem, we are carrying out a project to adjust our database structure. To avoid greater impacts on your operations, this adjustment will take place in phases between October 12th and 21st, 2024, and will be closely monitored by our technical team.
Until the project is completed, you may experience this issue in different ways. If you notice anything unusual, we kindly ask that you wait for a period of 3 minutes and try again. If the behavior persists, our support team will be available to assist you.
Our commitment is to always deliver the best for our customers. Therefore, in addition to making the necessary adjustments to mitigate the issue, we will also use the project phasing to enhance the robustness of our database, ensuring better performance in your operation.
We sincerely apologize for the inconvenience you are experiencing, as this is not the kind of experience we expect our customers to have.
We appreciate your understanding and reaffirm our commitment to resolving this situation.
We will continue to keep you informed about the progress and any relevant updates.
Posted Oct 09, 2024 - 16:15 GMT-03:00
Update
We continue to face instability on our platform, with temporary interruptions of 1 to 3 minutes in one of our applications. This can cause failures in executing actions and processing commands, affecting the exchange of messages in bots and the publication of new flows.

What we are doing:
1- We have completed the update of one of our components to improve the app's performance and provide a better user experience. [Completed]
2 - Regarding the migration, the information has been published on the scheduled maintenance status page and updates can be followed there. [In progress]

We will keep you updated on progress and thank you for your patience as we work to resolve this issue.
Posted Oct 09, 2024 - 00:15 GMT-03:00
Update
We continue to face instability on our platform, with temporary interruptions of 1 to 3 minutes in one of our applications. This can cause failures in executing actions and processing commands, affecting the exchange of messages in bots and the publication of new flows.

What we are doing:
We will update one of our components, aiming to improve the application's performance. We are also working on migrating one of our tables, which stores user context data, to another database, ensuring greater stability and performance. We will keep you updated on progress and thank you for your patience as we work to resolve this issue
Posted Oct 08, 2024 - 15:56 GMT-03:00
Update
We continue to monitor the environment and identify the root cause.
Posted Oct 04, 2024 - 16:51 GMT-03:00
Update
We have identified an issue with the functioning of our platform that may affect your overall experience with Blip. Therefore, in order to maintain transparency with you, we are writing to tell you what is happening and what we are doing about it.

What's happening and how it affects you?
Our monitoring systems reported timeouts in the builder applications, causing failure to execute actions and process commands in the blocks. This means that the exchange of messages in bots and the publication of new flows may be compromised.

This is an intermittent scenario, which we have seen last between 1 to 3 minutes. This means that the timeout sometimes happens and sometimes doesn't, giving us the scenario of instability and not a complete shutdown of the environments.

What are we doing?
The Infrastructure, Engineering, Monitoring and Support teams are constantly working in the war room and intervening in the environment whenever a timeout is noticed, so that the impact on applications is as small as possible during peak/failure times. These have been the most agile actions to normalize the environment, while the team continues in another direction in search of the root cause of these problems for a definitive solution.

During the analysis, we identified points for improvement in the applications and carried out optimizations in the application connections and data structure.

With this, we monitor the environment to ensure that it is operating as expected.

Finally, the entire Blip team reiterates our commitment to you and reinforces that we are working tirelessly to restore the full normality of our services and applications as quickly as possible.

Updated status:
We are continuing to monitor the environment and identify the root cause.
Posted Oct 04, 2024 - 16:48 GMT-03:00
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 01, 2024 - 14:30 GMT-03:00
Update
We are continuing to work on a fix for this issue.
Posted Oct 01, 2024 - 10:40 GMT-03:00
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 01, 2024 - 10:15 GMT-03:00
Investigating
We are currently experiencing low performance on Builder.

Impact:
Timeout in Builder actions.

Update:
Our support team is actively working on it.
Posted Oct 01, 2024 - 09:50 GMT-03:00
This incident affected: Hosting Business (Bot Builder) and Hosting Enterprise (Bot Builder).