1. Metrics Are A Must
Every website should have at least some basic measurements in place around site traffic and response times. You may also want to try wiring up measurements on HTTP response codes (think 404 or 500's per minute) to track the general health of your website. Having additional metrics on system resources can be a lifesaver if you need to correlate performance issues across different systems. This can also help you to find bottlenecks that you might not otherwise run into except during heavy load.
Think your monitoring setup is already up to speed? Now is the time to work on creating some holiday specific dashboards based around critical and/or heavily used graphs; after all, no one wants to go digging through last year's pinned links in a slack channel to see what's going on. Once you have a few dashboards set up, spend some time walking through them with folks one step removed from your team, so that come crunch time people know where to look to get the information easily.
Concerned you don't have time to set up a bunch of monitoring infrastructure? Lots of monitoring as a service companies have near turn-key solutions that can get you up and running quickly. Many offer free trials for a limited time, or better yet, use something like Circonus which offers free accounts with a cap on the available metrics you can track. This gives you plenty of room to get started, allows you to keep your data for analyzing after the holidays, and won't require a bunch of extra hardware on your end.
2. Sprinkle On The Caching
Ah caching; often touted as a panacea for performance issues, this can often cause as much trouble as it solves. One of the big problems with adding caching into a web application is that it can change the logical workflow for how customers see and interact with data. This can create cases where you have to figure out ways to do cache invalidation which aren't always obvious. That said, there are a couple of cases where you can add caching without having to make significant modifications to your internal app.
The first idea is to simply cache your initial landing page and perhaps nothing more. Depending on your traffic patterns, often time the primary landing page is the item that creates a bottleneck on your site, as people load up this page trying to determine where they need to click next. Even caching this page with a small timeout (1 minute) can make a huge difference in the number of requests and ultimately the amount of resource pressure on your back-end infrastructure. Worried that your landing page includes dynamic content and customizations? Perhaps consider creating a new, static landing page, focused on the most highly sought after information from new or occasional users who are now visiting the site and driving up traffic. Adding a splash page like this in front of your normal page isn't something you want to do in the long term, but for a few days during the holidays, it can buy you a lot of breathing room.
Another idea that can also reap good rewards is setting up a reverse proxy in front of your normal web setup. Depending on your setup, you can use this technique to offload specific items from the application server. Look for work that doesn't require specific state information from your application. Serving static assets like images, CSS, JavaScript files, or even statically rendered pages, is an obvious first choice to serve these items with long-lived cache headers, once again reducing the number of requests your systems have to take.
In both of these cases, you can generally set up a minimal solution using tools like Traffic Server or Nginx, or look to outside CDN services like Fastly or Cloudfront for an initial setup which won't require running your own infrastructure to get started.
3. Have A Scaling Plan
Another idea you should implement is to go through a scaling exercise with your operations team, so you can walk through each component of your system, and understand which parts can be scaled, and what is necessary to scale them. There is nothing worse than thinking you can quickly spin up new web servers only to realize that someone added a new component that is disk bound and now cannot be horizontally scaled without engineering effort to set up data replication between servers. So what you thought would take minutes now takes an hour just to ship the data between servers. I've heard enough people say "I can't believe we did something like that" to know that it's worth at least talking through the idea, and even better to run through it at least once to prove it works.
At a minimum, you want to come up with a run list for scaling options so that decisions are made in advance rather than have to be figured out under pressure. Sure, maybe you can reboot your database server onto a more powerful VM size, but how long does that take, and, more importantly, can you afford the downtime for that? Maybe you should spin up a more powerful secondary ahead of time, and then failover if that becomes necessary. Again, for each component, consider whether you can add more hardware (virtual or physical, scaling the system vertically) or add copies (scaling it horizontally) and decide which scenarios require which reaction. This can also help you understand where your limits are, so you don't spend your time on efforts that cannot bear any fruit.
4. Ain't No Party Like A 3rd Party Service
Everyone loves 3rd party services; they add functionality while making many parts of the service somebody else's problem. Until it becomes your problem. You might be surprised by the number of 3rd party services that don't do adequate load testing, but you shouldn't be; many of the companies running these services are just like yours; under pressure to add new features, with limited resources and budget (aren't we all?). Even Amazon has had cascading failures due to different components going down; trust me it can happen to anyone.
What this means is you need to be aware of how your site will behave when various 3rd party services begin to fail. Better still is if you can come up with ways to mitigate them. Unfortunately, in many cases, this can requires code changes. But before you start down the path of re-implementing a given service, instead try to determine if you can introduce a method for simply turning the service off. Doing this right means isolating the 3rd party code and putting conditionals so that you can easily flip a flag or add a few variables and have the service temporarily turned off. You don't want to be rewriting code like this under pressure, but being able to push out simple config changes can be a lifesaver.
5. The Holidays Are About People
One last suggestion that is always worth remembering, is that it is quite possible that your website is not the only thing that might be adding to the stress of your team this holiday season. Extra busy holiday schedules including concerts, relatives visiting, shopping lists to deal with and who knows what else. Stress levels are often high this time of year, so remember to look out for each other. If your people need flex time or time away during non-emergencies, make sure to advocate for that as much as you can. This also makes the next few weeks a good time to review everyone's ability to work remotely and make sure folks understand where / when communication around issues will take place. Sure it sucks to have to work on a holiday, but it's far worse if you have to do it from the office.
And remember, if you do find yourself dealing with production issues, keep an eye on how much time people spend firefighting and try to rotate people in and out when you can. When trying to solve complex failures, you need fresh minds when you can get them. While people need to take breaks, in a high-stress environment adrenaline kicks in and people may not realize when they have been at it too long. In most cases, service problems seen during the holidays can be recurrent if you aren't prepared (high traffic from Black Friday events will eventually go down, but can often re-manifest over the weekend or again on Cyber Monday), so it's up to you to make sure the team is ready for that.
In Conclusion
There is never enough time to do All The Things™, but if you stay focused, there is still time to make headway. Despite all the hype, it is possible to be successful without rewriting everything into a cloud-based microservices DevOps infrastructure™. But time is running out; you need to stay focused and make sure that what you are prioritizing and working on is intentional, focusing on any immediate concerns and doing as much to prepare as you can in the next couple of weeks. It is those ounces of prevention now that save you from the pounding a cure will require later.
If you have questions or looking for help preparing for (or recovering from) the holidays -- we’re here to help. Let us know if we can assist you this busy season.