Optimizing Cosmos DB usage often saves you a lot of money.
You can reduce the yearly cost of running Cosmos DB by up to 5 figures.

Cosmos DB is a cloud database running on Azure. Like most cloud solutions, it is offered as a service, so you don’t need to manage your own instances or worry much about operations. All you need to bring to the table is your data and the applications to work with it.

However, Cosmos DB is not a traditional relational database. It is marketed by Microsoft as a multi model database and it is widely known in the world of distributed databases.

In a similar manner to how the schema design of your system can make or break an application, with Cosmos DB, the proper attention to the distributed nature of your data is a key. It means selecting a partitioning key that would serve your needs well. A partitioning key that will generate hotspots in your system will lead to rate limiting and failed requests.

Consider the Database Pricing Model

Another important aspect of building good applications with Cosmos DB is to consider the pricing model. Cosmos DB charges based on provisioned throughput, or, how much load you are putting on the server and how much storage space your database uses. For Cosmos DB, optimization and tuning mean:

  1. Reducing the overall monthly spend you pay Azure.
  2. Improving the performance of your application.
  3. Avoiding overstepping the limits of your provisioned throughput.

These are all intrinsically tied together. In order to reduce expenses, you need to use Cosmos DB in an optimal manner, which will allow you to reduce the amount you provisioned as well as improve your application performance.

A major problem is the Cosmos DB provision output in Request Units / sec. To avoid being rate limited, you must ensure that you provision enough capacity to handle your maximum load. It is common to have orders of magnitude difference between normal operations and peak load. That used to be called the Slashdot effect. This effect is seen clearly on high peak days like Black Friday and Cyber Monday.

You can adjust the provisioned capacity you reserve dynamically, but there are some caveats. In particular, increasing the provisioned capacity may take a while to take effect since resources need to be provisioned for this capacity. If you want to reduce the provisioned capacity, you generally cannot set it to lower than 10% of the maximum you ever provisioned.

For those reasons, dynamically updating the provisioned capacity is not something that is routinely done. Microsoft offers the option of reserving capacity, in which case Cosmos DB will manage the provisioned capacity using an autopilot mode. The issue with the autopilot mode is that it costs 1.5 times the cost of setting the capacity manually.

Optimizing your Cosmos DB usage is an important task, which often saves you a lot of money. It is common to reduce the yearly cost of Cosmos DB by 5 figures! With proper tuning, it is common to cut your Cosmos DB budget by 70%.

A Much Faster Database

In addition to reducing the money you spend on Microsoft, you are also gaining a much faster system. Latency matters. An extra hundred milliseconds in rendering a page cost Amazon 1% of their sales. Google’s traffic dropped by 20% when their latency was merely half a second slower.

The cost of unoptimized system is clearly visible in your Azure invoice, but a much larger piece of the pie is the lost business opportunities in the customers that dropped out of your sales funnel.

Cosmos DB, like all other solutions, gives you a tradeoff for the expertise necessary to get it working at optimum levels. The challenge is that you need quite a bit of expertise and knowhow to get things working properly. Even when you do have this expertise, it is not easy to figure out where your costs and performance have gone. I like this story about figuring out how a single line fix (hidden very deep inside a big codebase) was able to reduce the database bill by 80%.

In this story, we have a dedicated team with the right expertise, but it took them weeks to figure out what the problem was, and only when it got so bad that their website kept falling down. This is similar to the task developers face with manual memory management. It sounds pretty simple, to start with. Make sure to free every piece of memory that you allocated. If each malloc() is matched by a free(), there is no problem.

We have over half a century of experience telling us that this doesn’t work. Humans aren’t able to keep track of memory by hand. We have tools for that. These are either managed runtimes, which come with automatic Garbage Collection or tools such as Valgrind to address sanitizer and similar. All of them take over in an automated fashion and free the developers from having to be absolutely perfect at all times.

Monitor, Analyze, Get Recommendations and Alerts

For Cosmos DB, there is the Cosmos DB Profiler, which monitors your application’s interactions with Cosmos DB, analyzes the data and offers you insight, recommendations and alerts on an ongoing basis.

Before I get too deep into discussing the Cosmos DB Profiler, take a look at the following screen shot, showing the profiler mid operation:

There is a lot of information on the screenshot, because the profiler is able to pull quite a lot of data from the interactions between your application and Cosmos DB. On the top left, there is the list of requests made to your application. This gives you the application specific context to understand what queries and operations the application performs as a result of a user’s action. On the top right, there is the list of queries in a particular request. This allows you to drill down into the actual operations done by your application. On the bottom left, you have overall statistics, and on the bottom right, there is the actual query that the application sent.

If that was all the profiler did, that would still be enough to give you a lot of insight into what is going on in your application. It’s your application, you want to know what is going there, right?

Applications are built by teams over long period of time. It is common for a developer to be familiar with just specific parts of the system. After all, there are only so many hours in the day. To understand the system as a whole, you need to be able to see things clearly. You need to see not just your specific piece, but what is happening everywhere in the system which might affect the end result.

The profiler is capable of much more than just showing you what your system is doing. You are able to see the actual queries performed by your application, and you can dig deeper:

Here, you can see the details about a single query, time to execute operation, the charge that this query consumed and the number of affected documents. We can dig deeper into the operation and pull detailed metrics about the query, as you can see below:

When working on large systems, and most applications using Cosmos DB are large, it’s easy to take an action that spawns a lot of activity. Tying that activity back to the original line of code that caused it can be a challenging task. With the Cosmos DB Profiler, it’s simple:

When you are able to look at the query and then jump directly to the code that generated it, you have a much easier time traversing your codebase. Debugging no longer entails endlessly trying to understand what is going on. You can inspect your system in real time.

There is more to what your Cosmos DB Profiler can do. One of the essential features it brings to the table is analysis. Each one of the queries that the application executes is analyzed, both individually and as a whole.

For example, let’s consider the following query:

This query fetches 487 articles from the database. That’s a pretty expensive query. Just 5 of these per second will saturate the default capacity you get from Cosmos DB.

The profiler has a few things to say about this query:

  • It is not using a partition key, so it requires Cosmos DB to scan multiple partitions.
  • It is also unbounded, so it will return as many results as there are in Cosmos DB.

Digging into the query metrics, we can see how much memory this single query cost:

The Cosmos DB Profiler will identify and alert you for over a dozen suspicious behaviors from your application’s interaction with Cosmos DB. You don’t need to have a Cosmos DB expert reviewing all your database code, the profiler will do that for you. You can run the profiler in live mode, as part of a development session or as part of your Continuous Integration routine.

Let’s consider the following scenario: you have a bug in your system. The bug itself isn’t that important and it is assigned to a junior developer in the team:

Cosmos DB’s indexes are case sensitive, so if you want to perform a case insensitive match, you need to change our code like so:

This works, but at a cost. Now Cosmos DB can’t use an index for this query, turning what used to be a fast (and cheap) operation into an incredibly expensive one. Cosmos DB will now have to scan your entire dataset to return a result.

The problem is that such a change can slip through a code review quite easily because reviews focus on correctness of function and performance, not necessarily cost. However, this will not go unnoticed by your Cosmos DB Profiler Code review, which catches this issue and alerts you.

In order to handle case insensitive matches, we need to introduce another field author_lower so the database engine case can query that directly, instead of scanning through the whole collection.

There are a lot of such seemingly innocent pitfalls, which can slowly bleed your budget with preventable charges.

Essential for the Developer

These features are essential for the developer because they tie specific operations to the code that they work with. For the architects, there is a whole other aspect of the Cosmos DB Profiler: the analysis section. Using the analysis capabilities of the Cosmos DB Profiler, you are able to slice and dice the information along several axes.

The most interesting report you’ll want to look at is the expensive queries report, which allows you to focus immediately on problematic areas:

You can look at the data from many angles, depending on your needs. Based on the screenshot alone, an architect can execute a short review session on his application and gather a lot of optimization opportunities on the system.

The first step in addressing a problem is knowing that you have one. The next step is knowing where it is at.

The Cosmos DB Profiler works seamlessly with your production Cosmos DB instances as well as with local emulators on your developer’s machines. This gives you the ability to identify and address issues early in the process. You can plug it into your Continuous Integration infrastructure and write rules that validate your service level agreements. Along with testing that your application is free of bugs, you can test to make sure it is free of unexpected and unwanted cloud costs.

Cosmos DB Profiler is an essential tool for teams building applications on Cosmos DB. It’s available with a 14-day free trial. Take it for a spin and see what it can do with no hassle.