adamtaylor.me

A project that I'm leading has an existing monolith and a newer API that is partially derived from existing functionality in the monolith. We'd like to offer this API as a product separate from the original monolith, so we need to maintain two separate databases of user records.

This means we need some way to keep users in sync between the 2 systems. So, we built a fairly lightweight reconciliation process that simply makes sure some fields are in sync between them. The process itself isn't super important for this story, but part of the syncing process involves creating a new user in the API if we can't find it based on some attributes in the monolith.

I kicked off the user reconciliation process late in the afternoon and noticed it was pretty slow. It took around 300ms to create a single user record, and we needed to create hundreds of thousands of these records. At the rates we were going, it would take a few days of constant runtime to finish the process!

While not a huge deal, it was a good opportunity to look more deeply into the performance of our new API, which hasn't been battle-tested yet.

Another developer and I investigated several seemingly suspicious points:

Was it the auth token lookup process?
Were we missing an index on a heavily-used table?
Was our permission lookup really slow?
Was it actually the monolith lookup loop that was slow?

However, none of those yielded any results. We tinkered with little improvements here and there, but the process time was still surprisingly consistent. Never less than 300ms, and rarely more than 350ms.

So I kept digging. And boy, did I have to dig! I ended up wrapping every individual line inside the User#create controller method with a custom New Relic tracer so I could see exactly what was happening within each line.

I found that over 200ms of slowness happened when setting the User's password attribute! Looking deeper, I found this happened within the password digest process. In some instances, it was even getting called twice for a total of 400ms of unavoidable slowness.

My first thought was that our randomly generated passwords for these Users were just too long. We were generating a secure password, just under 60 bytes long. However, when testing the difference between a 60-byte password and a 12-byte password, it couldn't be measured in milliseconds because it was so small! There was no way around this 200ms process time if we were setting a password at all.

More digging was required before I realized that bcrypt was likely our culprit, and worse... it was by design. I learned that bcrypt is a purposefully slow algorithm because it makes it much harder to brute-force password encryption.

You can change the number of "stretches" which essentially changes how many iterations bcrypt uses, directly correlating to an easier or harder-to-break password.

For example, in test environments, this value is set to 1. In production, it is set to 12. These are exponential values, meaning the difference between a value of 12 and a value of 20 is several seconds difference in encryption/decryption time!

With all this knowledge, I came up with a novel solution allowing us to process users in 35 milliseconds as opposed to the original 200 milliseconds per user.

Was this trip to Performance Land™ completely necessary? No, but I learned a great deal about performance tracking within New Relic and how to isolate bits of code to see where latency is happening. Furthermore, I developed a better understanding of how our system holds up under pretty steady hammering from the monolith application, which is useful knowledge.

Here's a fun graph of the before (#1) and the after (#2) of my changes:

Performance improvement with Ruby on Rails