Multi-Tenancy in Django

Here's how we implemented multi-tenancy on a recent Python / Django app build. This article describes our approach, shares some code and compares it with an earlier project we did using a different method.

In my last post about Multi-tenancy in Elixir using Ecto I described two approaches to implementing multi-tenancy:

One solution for this would be to add a tenant_id column to every table in the database. All queries would then include this id. An alternative solution is to store each tenant’s data in a separate Postgres schema, and query against that.

That first post describes the Postgres schema approach. On a recent project we made the opposite choice, and added tenant_id to every database record. On both projects the ask was very similar, to allow other organizations to “whitelabel” the application with their own branding but with completely distinct data. In this post I’ll describe the implementation, flag some rough spots that we didn’t initially catch, and close with some comparison between the two approaches.

Django implementation

At a high level the tenant system:

  1. Adds middleware that matches an incoming request to a tenant, based on something in the URL (a different domain in this case)

  2. Stores the current_tenant_id in thread local storage.

  3. Whenever a TenantAwareModel makes a query, filter the query by the current_tenant

We’ll walk through each of those steps in more detail, but first some background. The project in question was implemented in Python / Django (as opposed to Elixir / Phoenix described in the previous post). This lets us (ab)use global state for storing the current tenant in a way that Elixir’s functional paradigm prevents.

Our first task was to add tenant_id to ALL THE THINGS

We did this defining a TenantAwareModelMixin that each model in our system could inherit from, which added a relationship to a Tenant. It also overrode the default manager, which I’ll discuss in a later section.

Our Tenant model was pretty simple, with just a domain and an associated Brand object that kept track of that tenant’s colors, images, and contact info.

Middleware

We added a SetTenantMiddleware that looked approximately like this: domain = clean_domain(request.get_host()) tenant = Tenant.objects.get(domain=domain) set_tenant(tenant) You could do the same thing based on subdomain or something else in the request. If you’ve got any middleware that makes database queries, you’ll want SetTenantMiddleware to come before it so that those queries are properly scoped. Otherwise, this straightforwardly puts the tenant into thread local storage.

What is threadlocal?

From the docs:

Thread-local data is data whose values are thread specific

This means that it’s not available to code in other threads, but is available “globally” to all code running within the thread. We came across this technique in the django-auditlog source and thought it fit our use case.

At the beginning of the file (apps.whitelabel.helpers for us) you grab a reference to the storage: threadlocal = threading.local().

Then we use it like so:

def set_tenant_id(id):
    threadlocal.whitelabel = {"tenant_id": id}

Then we make it accessible:

def current_tenant():
    tenant_id = current_tenant_id()
    return Tenant.objects.get(id=tenant_id)


def current_tenant_id():
    try:
        return threadlocal.whitelabel["tenant_id"]
    except Exception as e:
        raise Exception(
            """
                Tenant is not set. Use `set_tenant`, to set the tenant before running any queries
                from apps.whitelabel.shortcuts import set_tenant
            """
        )

That’s all there is. Now the tenant is available whenever we need to make a query.

Inheritance and automatically filtering queries

The final piece that makes this work is overriding the default model Manager to use a custom TenantAwareManager that provided a TenantAwareQueryset. This article was very helpful for me in learning how the Django ORM works. It explains the design patterns the ORM implements as well as how to plug in your own custom Managers / QuerySets.

Our TenantAwareModelMixin looks like this:

class TenantAwareModelMixin(models.Model):
    """
    An abstract base class model that provides a foreign key to a tenant
    """

    tenant = models.ForeignKey("whitelabel.Tenant", models.CASCADE)

    objects = TenantAwareManager()
    unscoped = models.Manager()

    class Meta:
        abstract = True

Almost every model in the system will inherit from this, giving them all tenant_id and overriding objects to point to our custom Manager. This means that, without modifying our existing code, we could plug in and automatically filter those queries by tenant_id. For the full code of our custom manager and queryset, check out this gist.

Some Rough edges and #protips

While we were happy with the basic approach, here are a few details that came up.

  • Include an unscoped manager: After setting objects to the TenantAwareManager, add unscoped = models.Manager() to give yourself access to all data without the automatic filtering. This is really helpful for any database-wide tasks you might run into.

  • Add the tenant to your error monitoring For this project we used Sentry, but the principle should apply regardless of what tool you choose. In our set_tenant function where we store it in thread local, we also set a sentry “tag” with the current tenant. This is immensely helpful for debugging / reproducing errors.

  • Background jobs need a tenant passed into them Our system has a few celery tasks that execute in the background. Since each job will be running in a different thread than the main request which set the tenant, it will need to be enqueued with a tenant_id passed in. Our approach was to change each task to queue up new subtasks for each tenant. That change looked roughly like:

Python
def do_some_task(kwargs):
  work()
Python
def do_some_task(kwargs):
  for tenant in Tenant.objects.all():
    tenant_do_some_task.delay(tenant_id=tenant.id)

def tenant_do_some_task(tenant_id=None):
  set_tenant(tenant_id)
  work()

This was easy enough, but introduced a subtle bug, which the next tip solved.

  • Add a tenant context manager A context manager is a Python technique for defining a “runtime context”. It lets you allocate resources at the beginning of the context and free them at the end. It’s typically used with the with statement like so:
with open(“filename.txt”, “w”) as opened_file:
  file.write("Hello World")

At the end of that block, the file will be closed automatically by its context manager.

We added a tenant context manager [gist] so we could set and cleanup the contents of thread local explicitly. The motivation came from our celery tasks. The celery workers are a pool of threads which perform a queued up task and then go back to waiting for more work. Prior to the context manager, this meant that if the previous task had called set_tenant, its value would still be in thread local when a new task got queued up. This was fine if the first thing that next task does is also set_tenant, but we had a few tasks which were not tenant specific. Having that tenant value lingering caused some subtle bugs, which the context manager fixed right up. Now all of our tenant tasks look like:

def tenant_do_some_task(tenant_id=None):
  with tenant(tenant_id):
    work()

Comparison

Having implemented both the Postgres schema and a row-based approach we can now compare them. While we implemented them in different languages and web frameworks, either approach could have been used in either case. Both of these solutions worked well, but there are tradeoffs between them.

Data Isolation

The Postgres schemas offer better data isolation. Without a tenant set, the schema-based implementation cannot make a query. It would generate some query like SELECT * FROM null.tablename.columnname and the database would appropriately error out. In the row-based approach, querying without appending WHERE tenant_id = my_tenant_id would give you back the whole world. Now, we haven’t actually had any cross-tenant data leaks. We’ve put guardrails into the code so it throws exceptions when a tenant isn’t set. But, that required extra work and is not as bulletproof as the schemas.

Global State

This python implementation relies on (queue spooky music) globally shared mutable state! This is the big bad that functional programming protects us from. And I agree that it’s a bad practice generally! But in this case, I think it allowed for a very clean implementation. We’re only setting the tenant once per request and it’s not actually global, it’s isolated to the thread. I think a similar thing could be done in Elixir, passing the tenant around between processes in the same way we did here for the Celery jobs.

Tenant Ignorance

One aspect that was very important to me was isolating the multi-tenancy code from the regular work a developer on this platform would need to do. The python implementation was almost invisible for regular work. Programmers joining this project don’t need to know much about how multi-tenancy works, they just need to make sure they grab the right mixin when making a new class. Background jobs are the only part of the system that require you to understand a little bit about the details. This is much less burdensome than our Elixir version that required passing around the specially constructed Repo into every function in the application.

Schema Migrations

For this Django approach, after the initial migration to add a tenant_id everywhere, subsequent migrations were all regular, standard Django migrations. In the Postgres schemas version, we needed to introduce “runtime migrations”. Each time a tenant was created, they needed to have a new Postgres schema allocated for them and set up with the app’s current data schema. Then, any update to the data schema requires looping over all the tenant schemas. Whew! I’m getting confused just writing out the word schema that many times, but the conclusion is that the Postgres schema version introduced more complexity around changes to the structure of your data.

Conclusion

It’s not often that we get a chance for a do over on our technical choices. I really enjoyed this opportunity to try solving the same problem with two totally different approaches, in two totally different tech stacks. In the end, I think both solutions have their place. Hopefully these two articles can help you decide which is best for your situation.

Acknowledgements

All of this code was written in close collaboration with Shaan Savaariyan, with help from Joe Jackson (particularly the context manager) and feedback from the rest of the dev team.

Dylan Lederle-Ensign

Dylan is a developer who leans on rigorous academic training to solve practical problems for real users and organizations. He works in our Falls Church, VA, office.

More articles by Dylan

Sign up for The Viget Newsletter

Nobody likes popups, so we waited until now to recommend our newsletter, a curated periodical featuring thoughts, opinions, and tools for building a better digital world. Read the current issue.