Rescuing an exploding Rails App with Skylight.io
Recently, a new client approached us with a performance problem on their existing Ruby on Rails application; they were experiencing massive growth with over 50,000 new users per day signing up, and their app was receiving over 400 requests per second (and growing).
The rapidly increasing load was leading to big problems, with their existing Rails application experiencing frequent outages and causing sleepless nights for their team. They asked reinteractive to investigate and find out how we could get the app stable as fast as possible.
As time was of the essence, our first move was to install our performance tool of choice skylight.io (referral link, with a $50 credit).
Within 15 minutes we already had enough data flowing into Skylight to use its helpful (and aptly named) Agony Index to get an idea of what was going on.
It was clear that a lot of performance could be gained by digging into the DevicesController#index
action. It had a 463ms "problem" response and was typically around 144ms. The key issue was this endpoint was being hit a LOT. It was the most hit action in the system getting close to 400 requests per second, so any change we made to it would have a massive impact.
When we dug deeper, we learned that the main web servers were being hammered, causing load time to increase and further degrading performance. So getting this action performant would give a lot of wins across the whole system.
Opening up this action we found the following trace:
Some things jumped out immediately. There was a large spread across the response time distribution, all they way up to 400ms. This indicated that the server was taking variable amounts of time to service the request, which usually points towards a loaded down server with not enough available capacity.
Looking at the trace, we could see that there was a SQL select
query to a settings
table, as well as three separate select
queries to the devices
table. Each of these queries were taking about 12ms to complete, so they were good targets for optimising.
Looking at the code, we found that there was a Setting
model and that a Setting
record was being called on each request to configure the app. After confirmation with the client it was found that this setting data barely ever changed (about once every few months) and so implementing this in a cache would be totally fine. Now, looking at this from a Skylight trace it is completely obvious to not make a DB request 400 times per second to see if any application wide configuration had happened to have changed, but this loading of settings data was buried within the application_controller
and pretty much everyone had forgotten it was being called. Seeing it bright as day in the trace though got everyone's attention, and so this could be cached out instantly.
This was soon done and not only did it immediately drop the response time to the DevicesController#index
action by 12ms, but also it removed the SQL query for every other request landing on the app, saving 12ms by about 1.5 billion requests a month. This one small change saved over 400 HOURS of computing time a month! Not a bad saving from what worked out to be about 3 lines of code.
Next up were the three select queries against the Devices table.
The first call was loading up the device in question by ID. As Rails already by default creates a unique index on the devices table primary key, not much could be done about this.
But the other two calls were checking for uniqueness on a validation; that is, the Device
model had unique validation requirements on two attributes that looked something like this:
validates_uniqueness_of :first_attribute
validates_uniqueness_of :second_attribute
What this meant was that before the Rails app could save the devices model, it had to check to make sure that no other record in the table had either of these attributes, and the only way to do that was to fire two SQL queries before any create or update action that looked like:
SELECT 1 FROM devices WHERE devices.id = "?"
AND devices.first_attribute != "?"
SELECT 1 FROM devices WHERE devices.id = "?"
AND devices.second_attribute != "?"
These two queries had to traverse the devices
table and were taking another 12ms each, or about 24ms for the pair.
As there were already unique indices on the devices table for both the first_attribute
and the second_attribute
, we were able to make a change in the Rails code to only reverify the uniqueness of the attributes if the record had changed. So we made the following change to the code:
validates_uniqueness_of :first_attribute,
if: :first_attribute_changed?
validates_uniqueness_of :second_attribute,
if: :second_attribute_changed?
This removed the need to check on data updates where the attributes had not changed, since if the attributes were still the same value that was retrieved from the database, they were already guaranteed to be unique.
After making the above two changes, the new trace looked like this:
Which showed a number of things that all pointed to success.
Firstly, the clustering of the response times were much tighter, indicating that the server had more breathing time to handle the requests (this was also in part to us helping the client boot up some more application servers) and secondly, the three select queries were no longer in the trace, saving 30-40ms per request, and also reducing the load on the database by about 900 queries per second.
The client app is now stable and we are helping them move to our OpsCare Service over the coming months. If you have a Ruby on Rails application that needs some performance tuning, please get in touch.
Latest Articles by Our Team
Our expert team of designers and developers love what the do and enjoy sharing their knowledge with the world.
-
No app left behind: Upgrade your application to Ruby 3.0 and s...
-
A look forward from 2020
-
Testing Rails applications on real mobile devices (both design...
We Hire Only the Best
reinteractive is Australia’s largest dedicated Ruby on Rails development company. We don’t cut corners and we know what we are doing.
We are an organisation made up of amazing individuals and we take pride in our team. We are 100% remote work enabling us to choose the best talent no matter which part of the country they live in. reinteractive is dedicated to making it a great place for any developer to work.
Free Community Workshops
We created the Ruby on Rails InstallFest and Ruby on Rails Development Hub to help introduce new people to software development and to help existing developers hone their skills. These workshops provide invaluable mentorship to train developers, addressing key skills shortages in the industry. Software development is a great career choice for all ages and these events help you get started and skilled up.