Degradation BLiP

Incident Report for Blip

Postmortem

**
Srs. Clientes**

Lamentamos imensamente a falha ocorrida na plataforma que por sua vez afetou os clientes.

Detalhes de toda tratativa até a restauração do ambiente:

Problema: Lentidão na plataforma após realização do scale up do banco de dados transacional ( banco de dados de Logs de mensagens, notificações e comandos enviados);

Impacto:

Todos os componentes do portal e Desk e usabilidade associados a eles ficaram degradados.

Análise: Identificado um alto impacto para os clientes causado pelo processo de Scale Up, sendo necessário o cancelamento do processo. O processo de Scale Up foi cancelado às 17:47h. Após cancelado o processo de Scale Up foi observado um erro de TDS (Protocolo de comunicação do SQL Server).

Causa raiz: Em decorrência do aumento anormal do fluxo causando aumento no tempo de resposta do Banco dados que contém o contexto dos fluxos dos Bots).

Correção aplicada: O serviço foi totalmente restabelecido após o Scale Up no banco de dados realizado entre o horário de 21h do dia 23/03/2020 e finalizado por volta de 01:35 da madrugada do dia 24/03/2020. Ainda no dia 23/03/2020 foram implementadas várias manobras conforme detalhado acima, com objetivo de minimizar a lentidão identificada pelos clientes, uma vez que o Scale Up do Banco no horário de operação causaria um impacto ainda maior. Por volta das 17h01min após esgotado todas as tentativas, o time técnico em conjunto com a liderança resolveram realizar o Scale Up, no entanto, ao iniciar o processo, foi constatado um grande impacto na produção, sendo necessário cancelamento do mesmo por volta de 17h45min. Após o cancelamento foi observado um erro inesperado por parte do provedor de cloud, motivando o time técnico fazer um novo alinhamento, chegando a conclusão que o erro identificado estava ainda mantendo um impacto alto ao cliente, o que culminou a necessidade de realizar uma nova tentativa de Scale UP, novamente sem sucesso. No entanto nesse segundo momento de cancelamento do processo, foi observado que o erro já não ocorria mais. Com o ambiente ainda com lentidão mas estável, foi decidido pelo corpo técnico que o processo de Scale UP fosse iniciado no horário de menor fluxo. Ainda foi aberto um incidente interno 55634 para mapeamento de melhorias.

Início degradação: 23/03/2020 11h40min

Fim da instabilidade: 24/03/2020 01:35min

Posted Apr 01, 2020 - 11:08 GMT-03:00

Resolved

Problem:

Slowness on the platform with impact on message interaction. This slowness occurred as a result of a 50% increase in the flow of traffic on the platform compared to the last 10 days.

Actions taken:

Throughout yesterday, 03/23, we implemented some maneuvers to try to mitigate the slowness on the platform, with an impact mainly on the interaction of messages.

All the actions taken were to avoid ScaleUp during the operation, because due to the high volume of transactions carried out at the Bank in question, the procedure could cause an unavailability with a greater impact on the client's business.

The ScaleUp of the database started at 9:00 pm and ended at 1:30 am.

We monitored throughout the morning and had no scenarios related to the problem.

Posted Mar 24, 2020 - 15:29 GMT-03:00

Monitoring

The ScaleUp of the database was processed during the night.
Processing was completed around 1:30 am.
We continue to monitor the platform.

Posted Mar 24, 2020 - 06:58 GMT-03:00

Investigating

Problem: Slowness on the platform with an impact on message interaction. This slowness was due to a significant increase in traffic flow compared to the last 10 days.

Actions taken: Throughout the day we implemented some maneuvers to try to mitigate the slowness on the platform with an impact on the interaction of messages.
All the actions taken were to avoid ScaleUp during the operation because due to the high volume of transactions carried out at the database in question, the procedure could cause unavailability with a greater impact on the client's business.

Actions in progress: ScaleUp in the database
Start date/time: It will start at 9 pm today (03/23/2020)
End forecast: estimated duration of 13 hours, this time can be minimized by the volume of traffic during the night.

Posted Mar 23, 2020 - 20:43 GMT-03:00

Update

The technical team is applying a configuration to some servers in order to resolve the slow points.

The forecast is at 5PM.

Posted Mar 23, 2020 - 16:33 GMT-03:00

Update

We are continuing to investigate this issue.

Some actions have already been taken to resolve the problem as mentioned in the last update.

Posted Mar 23, 2020 - 14:52 GMT-03:00

Update

The technical team continues to investigate the case, some maneuvers have already been carried out to solve the problem as mentioned in the last update. We saw an improvement, but we still have slow points in the exchange of messages.

Posted Mar 23, 2020 - 14:49 GMT-03:00

Identified

The impact is occurring on the Enterprise cluster.

Actions:

A new server was added in production. We are monitoring the performance of message processing.

Scale up was performed on the Transactional Database from 16vCores to 24vCores.

Posted Mar 23, 2020 - 12:39 GMT-03:00

Update

We are continuing to investigate this issue.

Posted Mar 23, 2020 - 11:49 GMT-03:00

Investigating

We identified slowness on the BLiP platform. the problem is being analyzed with priority and we will update as soon as we have new status.

Posted Mar 23, 2020 - 11:48 GMT-03:00

This incident affected: Hosting Business (Bot Builder, Bot Router) and Hosting Enterprise (Bot Builder, Bot Router).