Server Down for 4.5 Hours - 2004-06-17

Marc

Fully vaccinated are you?
Leader
Folks,

About 2.30 PM today I received an e-mail that the server was down from Server Matrix - they continuously monitor the IP and report failure to me via e-mail. Unfortunately I was at a client facility and didn't get the e-mail immediately. I saw it (and a bunch of other e-mails) when I got back about 6.45 PM.
The Planet Support Ticket 288957PLNT was Created

Please go to xxxxx to view and/or edit the support ticket.

Ticket Details:
Queue:Technical Support
Status:OPEN
Summary:OUTAGE: CaymanSystems01 (69.93.111.34)
Details:Our monitoring system is currently showing PING Critical for CaymanSystems01 (69.93.111.34).
Please let us know how you would like to proceed.
Thank you.

Setting to close in 48 hours.
Resolution:

***** The Planet Monitoring System: IPAlert *****

Notification Type: ACKNOWLEDGEMENT

The following host appears to be: DOWN

Host: CaymanSystems01
Address: 69.93.111.34

Date/Time: Thu Jun 17 13:30:44 CDT 2004

Additional Info: 288957

Please do not reply to this email. If you have additional information, please update the ticket using the link to the support system provided. If the support ticket was closed, and the issue was not resolved to your satisfaction, please feel free to open another ticket and reference this ticket number.

Thank you,
The Planet
Yes - I did seriously panic... Server Matrix, who I lease the server and connection from, answered my telephone call almost immediately and had everything fixed literally within minutes. I have a RAID server so everything was mirrored on a second drive.

The delay was because I can't be here 24/7/365 watching and I can't afford to pay someone to. I want to thank Server Matrix for their very prompt fix. I also set up an escallation procedure with them where they do not need me to OK their going ahead and fixing a problem, if it is within their scope, without my permission. In this case they knew there was a problem, and even e-mailed me about it, but could not technically look at the server without my permission.

I apologise for the interruption in access.

I have verified the database and although there were a number of open tables, I was able to repair them without incident so we're essentially starting where we left off - I do not believe there was any data loss.
 
Elsmar Forum Sponsor
No - But I'll ask you to write down my phone number and call me if you can't reach the server in the future.

EDIT - ADD

If any of you folks want to earn an Elsmar Cove Happy Campers merit badge, particularly those of you in the US, write down my telephone number and by all means give me a call if you can't reach the server.

On the other hand, as most of you know, the site has not been down due to a technical issue in a very long time, I don't even remember it being offline due to a techincal issue for more than an hour or two so I don't see this downtime as an issue. Yes - I've revised my FMEA and done a CA... PA is giving SM permission to act without first obtaining my permission and asking you folks to telephone me if you can't connect.
 
Marc said:
No - But I'll ask you to write down my phone number and call me if you can't reach the server in the future.

EDIT - ADD

If any of you folks want to earn an Elsmar Cove Happy Campers merit badge, particularly those of you in the US, write down my telephone number and by all means give me a call if you can't reach the server.

On the other hand, as most of you know, the site has not been down due to a technical issue in a very long time, I don't even remember it being offline due to a techincal issue for more than an hour or two so I don't see this downtime as an issue. Yes - I've revised my FMEA and done a CA... PA is giving SM permission to act without first obtaining my permission and asking you folks to telephone me if you can't connect.
I noticed it was down, but, self-absorbed in a presentation I was giving this evening, I just shrugged it off - to deal with at another time.

I copied your phone number and sent it to myself via email to tuck in a special Cove folder. I'll keep it in mind if anything similar should occur.

Considering the big hullabaloo over Akamai Technologies Inc. and Yahoo being hit by DOS (Denial of Service) attacks, I assumed you were caught in something similar. Yahoo was having a similar problem just about the same time you were today. Some other Akamai Technologies Inc. customers just shut down until they could sort things out.

P.S. Your apology was welcome and unique - in my memory, most web sites do not even acknowledge they were ever down.
 
Last edited:
As far as I know there was no DOS attack or other deliberate 'hack' attempt. However, I haven't reviewed the Apache log files. I'll get around to that but I doubt that a DoS or DDos was the problem.

When I called the Server Matrix folks the gal told me there was a video hardware problem (no - I have no idea what the video aspect has to do with it as it's a rack server) so they can't 'see what happened'. She did a HARD RESET, soft shutdown and soft reboot. I ran a REPAIR program on the database and it indicated all errors were 'fixed' (a lot of faith, there...).

Wes, I appreciate it and I'm sure IF the server goes down again I'll get at least 1 phone call! This is so rare that I really don't anticipate a recurrence, but as I said - IF you (or anyone) can't reach the site, by all means - give me a call and let me know. In this case I could have had the site back online within minutes instead of hours.

As a general historical FYI - I lease a server from a specialist firm https://servermatrix.com/ in server farms so that the server is on a 'backbone' and has redundant services. Every user is connecting to a 'backbone' so THEIR connection is the 'weakest link' below 10x ethernet - which is pretty darn fast for loading a web page. The site is on a dedicated server - no processor sharing or any of that stuff that some folks, back around a year ago or more may remember, was a serious problem here. I didn't even want more visitors back then because there were so many issues including the disk space aspects.

On this server I have about 65% free space and the processor load is typically below 0.6% - which is very good. Your connection may vary, but it's typically a router issue vs. a server issue.
 
Marc said:
Server Matrix, who I lease the server and connection from, answered my telephone call almost immediately and had everything fixed literally within minutes. I have a RAID server so everything was mirrored on a second drive.
Marc said:
I also set up an escallation procedure with them where they do not need me to OK their going ahead and fixing a problem, if it is within their scope, without my permission.
Yes, I noted that the site was down and of course i did suffer a bit from Cove withdrawal symptoms, but really: Who could ask for more than the above? Outstanding.
:applause:

/Claes
 
Back
Top Bottom