Entrepreneurs are an exceptional bunch of people. Each member of a founding team brings a vital skill to the table. Unfortunately, this doesn’t always mean that they are good at everything. Keep reading to find out how a fundamental misunderstanding of databases led to one startup’s unprecedented crisis.
We’ll call the company in question Company X. Company X is a major innovator in the field of artificial intelligence and natural language processing. It applies its unique technology to conversational commerce and is a pioneer in chat bot creation.
However, at the time of the incident, Company X was also a startup in its early days, which was not necessarily a good thing. The company had lots of AI specialists focused on systems—but database expertise was not part of their lineup.
As their product gained popularity, the company’s focus shifted to handling numerous needs with short-term goals. To manage all of this, the company utilized MongoDB. Company X deployed its entire infrastructure on a single machine with 16 GB of RAM and about 300 GB of hard disk. The database itself was under 150 GB with entire decision trees stored using MongoDB.
Keep in mind that MongoDB was designed to handle high-write applications. However, Company X made one crucial design mistake, which meant that its application was read-only and MongoDB was used to store decision trees.
The result: even when their load was normal, the machine’s CPU was consistently around 80%. As traffic increased, the issue worsened, and it seemed as though a single MongoDB instance wasn’t able to handle the read query load.
The DevOps team noticed that something was amiss and realized that they had a single point of system failure—which looked as though it was on the brink of a major disaster. To try and salvage the situation, they decided to upgrade their hardware to a 32 GB RAM machine with 8 CPU cores.
Their upgrade attempt didn’t improve the situation. Amazingly, even during normal traffic hours, the CPU was still registering at around 80%; as simple queries increased, the CPU slowed, the system stopped responding to data requests, and the whole thing crashed.
Pawan Tejwani is a database expert with experience in both SQL and non-SQL environments. Pawan provided occasional consultancy services to Company X, which contacted him in the midst of their crisis.
Fortunately, they had implemented tools for detailed monitoring, so Pawan could easily detect the atypical CPU demand. He had a sense that the hardware was not the issue, so he looked at their databases for answers; more specifically, he looked into how they were performing database queries.
What he saw was a bit of a shock: their system was simply dumping data into MongoDB, and this had a massive impact on performance. The problem amounted to a whole lot of bare documents sitting in MongoDB without indexes.
Pawan ran the New Relic digital performance and management solution, as well as a bit of code analysis to dig deeper, and found that a certain number of small queries were hitting MongoDB at high volumes. The solution was simple: he quickly added indexes to the required columns. Voila! CPU load instantly went from 80% to under 1%.
To back up his solution, Pawan continued the investigation and added a few more indexes that he surmised would be utilized frequently, given the nature of existing queries. In addition, he used MongoDB clustering in the creation of three machines, one as a primary, and two as secondary MongoDB t2.small instances.
After the implementation of Pawan’s solutions, overall operating costs decreased to 20% of the original figure.
Pawan explains his basic philosophy:
“Whenever you use any data store, be it SQL or not, keep its advantages and disadvantages in mind. That way, your design will be a good architectural fit and avoid these types of issues. A good design makes you choose good technologies that are right for your purpose. Remember: technology combined with good design bypasses most problems, and makes issues easier to solve when they occur.”