
To Our Valued Customers:
As you may be aware, on Wednesday, November 18, 2015, at approximately 2:00pm (14:00) New York Time we shut down the servers that power your store as part of a massive upgrade to our infrastructure facility. As communicated, our expectation was that this outage would last anywhere between eight and twelve hours. It has now been 24 hours, and while our team has been working nonstop to resolve the issue and communicate updates as best as possible, I would like to provide further clarity into:
- Why this upgrade was unavoidable.
- The sequence of events and root cause of the outage.
- What we are doing to resolve it and get all of your stores back on-line.
- What we are doing to prevent this from happening in the future.
First, Zoey has expanded significantly in the 16 months we have been in operation and we have all of you to thank for that. The infrastructure upgrade, while still underway, has come at a critical point where we must make significant changes to better serve you and keep your online businesses running reliably. This means upgrading the internet connectivity that powers your store, better preparing for disaster recovery by upgrading the power that is fed to our data center, supplying redundant power to our servers, implementing redundant cooling, and providing better physical security to our infrastructure.
To execute this task, our infrastructure had to be disconnected completely and moved into its new upgraded facilities. This move was meticulously planned and we contracted industry leading professionals to perform the tasks. There were many inherent risks, which we identified and made our best attempt to thwart. We explored all options to virtually migrate all stores in an attempt to keep them online during this upgrade process, or to keep pieces of the old infrastructure operational while other parts are migrated. After much consultation it was discovered that all other scenarios had much higher calculated risk and would still have resulted in significant down-time. In summary I would like to assure you that we have taken the best and most cautious plan of action.
Here is the sequence of events (New York time):
- 2pm Wednesday: Began powering down and preparing our infrastructure for the upgrade and move.
- 9pm Wednesday: Physical portion of the move completed.
- 11pm Wednesday: All power and cooling systems were in place and we began to power-up the networking, storage and server hardware. At this time we identified and began working with our vendors on a problem with our new internet connection.
- 2am Thursday: Successfully resolved the internet connectivity issue and moved into booting up our stores. At this time we then identified an issue with our storage system and began working with our partners to resolve it.
- 8am Thursday: Successfully resolved the issue with our storage system and continued to start up our stores.
- 10am Thursday: Our first VM was spun up and began testing the routing of internet traffic. At this time we identified an issue with our switches.
- 12pm Thursday: Continued work with our networking hardware provider to identify the issue being the configuration for our switches, which routes all internet traffic to the correct store had been wiped during the physical move and began working to restore it from backups taken before the move.
- 3pm Thursday: Work with our vendors continues in an effort to resolve what is the final piece before we are able to successfully route internet traffic to our stores.
What we are doing now to resolve this issue:
We have contracted additional outside professionals to help troubleshoot and resolve this issue. These experts will be on site commencing work at 5pm New York time tonight. We are also working closely under the guidance of our vendors to take all steps necessary to fix the problem as quickly as possible. While we have no ETR to share, there is an identified resolution path and our team is continuing to work nonstop to achieve it.
I would like to further clarify that during this process, all store data has been retained and that the root cause is the routing of internet traffic through our infrastructure to the correct store. A good analogy would be, we have the cities set-up, but the roads leading to those cities are inaccessible. Once we are able to restore the routing, we will have completed this upgrade.
Many of you have asked, and I strongly feel that you deserve a good response to the question of what we are doing to prevent outages from happening again. The answer to this question is at the core of our reasoning behind this current project. Ensuring reliable service requires redundancy in internet connectivity, power, cooling, backups and security. These provisions and more are now available to us in our new facilities. Ecommerce is a 24 hour business. We recognize and respect the importance of keeping your online business running and will continue to work with industry leading partners and vendors to provide the best service possible.
We will continue to update you via: http://status.zoeycommerce.com - while phone support and ticketing may be delayed due to the volume of inquiries.
I’d like to add that I have received each and every one your emails. I am personally in our new data center working to resolve the outage and will respond to each and every one of you once this upgrade is completed.
I would again like to sincerely apologize for the extended outage and assure you that we will continue to work as quickly and safely as possible to complete the work and restore global connectivity.
Thank you for being a valued Zoey customer and please look forward to additional emails from me in the coming days with any further details.
Thank you,
Uri & The Zoey Team
Comments
0 comments
Please sign in to leave a comment.