There is currently an issue with a database server that stopped responding. I am waiting for a reply from the data center to know what could be causing this.

Some instances are currently down, I am trying to solve the issue as fast as possible and will update here when I know something.

Really sorry for this situation. Some instances continue down but I can give you an update on what is happening.

One of the database servers stopped responding at 8:30 pm UTC and after the team in the data center made an intervention they detected that one disk in the RAID had failed.

All servers that I use have RAID 1 and in theory a disk failure shouldn't bring the server down but this one did and worst I can't get it to boot now.

1/3

So, I have requested the faulty disk to be replaced and I am waiting for that to be done.

Last case scenario, I will restore the database backups for affected Mastodon servers and there could be a lost of around 12 hours of data as that was the time of the last backup.

2/3

I am so sorry for this situation and will do my best to try and bring the failed instances up as soon as possible. I will keep this thread updated.

3/3

OK, we are back online! :blobsweats:

The data center replaced the disk and I rebuilt the RAID and all is there. No data should have been lost.

I will look more just to be certain and will be back with more information.

I will now do a restart of all instances to clear out all processes and be sure nothing is stuck.

Less than 30 seconds of downtime during this process.

OK, it all looks good and everything is running smoothly.

This was by far the biggest downtime of Masto.host, gladly it was partial and on affected around 10% of the hosted services.

Still, I am really sorry that this happened. I am exhausted (it's 3:24 am for me) and tomorrow I will think about finding solutions for dealing with situations like this.

Ideally it would be to create redundancy but it's hard to do that without increasing prices. It's a hard balance.

Follow

Thank you for your patience and for making it possible for me to run this fun project.

And just as a reminder, I leave here the 4th paragraph from Masto.host Terms of Service: masto.host/tos/

... I was dreading a day like this for over 4 years.

· · Web · 3 · 7 · 21

@nattukaran mostly because all options were bad. Either recover the backups and lose over 12 hours of data or wait and cross my fingers I could bring the server back up once the disk was replaced.
Gladly I was able to do it but it took way more time than it should.
The silver lining is that I learned a couple more things to deal with this kind of issues faster in the future.

@mastohost Ok😀
You know what they say experience is the best teacher.

@mastohost please don't be so hard on yourself. These things happen, it's how you respond that is the most important thing.

@duncanhart Well, in the moment it's hard for me to be clear headed but now looking back it's much easier to gain perspective. Thanks :)

@mastohost the heat and intensity of a 'crisis' blind us and we forget the wider, bigger perspectives. You're human, we all are and the feelings you experienced are because you care about the service.

@duncanhart Yep, pretty true. Still, was only able to sleep 3 hours because I was pretty wired. Let's see if I can chill now :)

Sign in to participate in the conversation
Mastodon

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!