HackerEarth

author’s Articles

Insights & Stories by Guest Author

Whether you're building your first team or scaling culture across regions, Guest Author's articles offer human-first insights rooted in real practice.

Tech Tutorials

Scaling database with Django and HAProxy

MySQL – Primary data store

At HackerEarth, we use MySQL database as the primary data store. We have experimented with a few NoSQL databases on the way, but the results have been largely unsatisfactory. The distributed databases like MongoDB or CouchDB aren't very scalable or stable. Right now, our status monitoring services use RethinkDB for storing the data in JSON format. That's all about using NoSQL database for now.

With the growing amount of data and number of requests every second, it turns out that the database becomes a major bottleneck to scale the application dynamically. At this point if you are thinking that there are mythical (cloud) providers who can handle the growing need of your application, you couldn't be more wrong. To make matters worse, you can't spin a new database whenever you want to just like you do with your frontend servers. Achieving horizontal scalability at all levels requires massive re-architecture of the system while being completely transparent to the end user. This is what a part of our team has focused on in the last few months, resulting in high uptime and availability.

It was becoming difficult for the master (and only) MySQL database to handle the heavy load. We thought we will delay any scalability at this level till the single database could handle the load. We would work on other high priority tasks instead. But that wasn't to be, and we experienced some down time. After that we did a rearchitecture of our application, sharded the database, wrote database routers and wrappers on top of django ORM, put HAProxy load balancer infront of the MySQL databases, and refactored our codebase to optimize it significantly.

The image below shows a part of the architecture we have at HackerEarth. Many other components have been omitted for simplicity.

Database slaves and router

The idea was to create read replicas and route the write queries to the master database and read queries to slave (read replica) databases. But that was not simple either. We couldn't and wouldn't want to route all the read queries to slaves. There were some read queries which couldn't afford stale data, which comes as a part of database replication. Though stale data might be the order of just a few seconds, these small number of read queries couldn't even afford that. The first database router was simple:

class MasterSlaveRouter(object):

    """

    Represents the router for database lookup.

    """

    def __init__(self):

        if settings.LOCAL:

            self._SLAVES = []

        else:

            self._SLAVES = SLAVES



    def db_for_read(self, model, **hints):

        """

        Reads go to default for now.

        """

        return 'default'



    def db_for_write(self, model, **hints):

        """

        Writes always go to default.

        """

        return 'default'



    def allow_relation(self, obj1, obj2, **hints):

        """

        Relations between objects are allowed if both objects are

        in the default/slave pool.

        """

        db_list = ('default',)

        for slave in zip(self._SLAVES):

            db_list += slave



        if obj1._state.db in db_list and obj2._state.db in db_list:

            return True

        return None



    def allow_migrate(self, db, model):

        return True

All the write and read queries go to the master database, which you might think is weird here. Instead, we wrote getfromslave(), filterfromslave(), getobjector404fromslave(), getlistor404fromslave(), etc. as part of django ORM in our custom managers to read from slave. So whenever we know we can read from slaves, we call one of these functions. This was a sacrifice made for those small number of read queries which couldn't afford the stale data. Custom database manager to fetch data from slave:

# proxy_slave_X is the HAProxy endpoint, which does load balancing

# over all the databases.

SLAVES = ['proxy_slave_1', 'proxy_slave_2']



def get_slave():

    """

    Returns a slave randomly from the list.

    """

    if settings.LOCAL:

        db_list = []

    else:

        db_list = SLAVES



    return random.choice(db_list)



class BaseManager(models.Manager):

    # Wrappers to read from slave databases.

    def get_from_slave(self, *args, **kwargs):

        self._db = get_slave()

        return super(BaseManager, self).get_query_set().get(*args, **kwargs)



    def filter_from_slave(self, *args, **kwargs):

        self._db = get_slave()

        return super(BaseManager, self).get_query_set().filter(

                *args, **kwargs).exclude(Q(hidden=True) | Q(trashed=True))

HAProxy for load balancing

Now there could me many slaves at a time. One option was to update the database configuration in settings whenever we added/removed a slave. But that was very cumbersome and inefficient. A better way was to put a HAProxy load balancer in front of all the databases and let it detect which one is up or down and route the read queries according to that. This would mean never editing the database configuration in our codebase — just what we wanted. A snippet of /etc/haproxy/haproxy.cfg:

listen mysql *:3305

    mode tcp

    balance roundrobin

    option mysql-check user haproxyuser

    option log-health-checks

    server db00 db00.xxxxx.yyyyyyyyyy:3306 check port 3306 inter 1000

    server db01 db00.xxxxx.yyyyyyyyyy:3306 check port 3306 inter 1000

    server db02 db00.xxxxx.yyyyyyyyyy:3306 check port 3306 inter 1000

The configuration for the slave in settings now looked like this:

DATABASES = {

    'default': {

        'ENGINE': 'django.db.backends.mysql',

        'NAME': 'db_name',

        'USER': 'username',

        'PASSWORD': 'password',

        'HOST': 'db00.xxxxx.yyyyyyyyyy',

        'PORT': '3306',

    },

    'proxy_slave_1': {

        'ENGINE': 'django.db.backends.mysql',

        'NAME': 'db_name',

        'USER': 'username',

        'PASSWORD': 'password',

        'HOST': '127.0.0.1',

        'PORT': '3305',

    },

    'analytics': {

        'ENGINE': 'django.db.backends.mysql',

        'NAME': 'db_name',

        'USER': 'username',

        'PASSWORD': 'password',

        'HOST': 'db-analytics.xxxxx.yyyyyyyyyy',

        'PORT': '3306',

    },

}

But there is a caveat here too. If you spin off a new server with the HAproxy configuration containing some endpoints which don't exist, HAproxy will throw an error and it won't start, making the slave useless. It turns out there is no easy solution to this, and haproxy.cfg should contain existing server endpoints while initializing. The solution then was to let the webserver update its HAproxy configuration from a central location whenever it starts. We wrote a simple script in fabric to do this. Besides, the webserver already used to update its binary when the spin off is from an old image.

Database sharding

Next, we sharded the database. We created another database — analytics. It stores all the computed data, and it forms a major part of read queries. All the queries to the analytics database are routed using the following router:

class AnalyticsRouter(object):

    """

    Represents the router for analytics database lookup.

    """

    def __init__(self):

        if settings.LOCAL:

            self._SLAVES = []

            self._db = 'default'

        else:

            self._SLAVES = []

            self._db = 'analytics'



    def db_for_read(self, model, **hints):

        """

        All reads go to analytics for now.

        """

        if model._meta.app_label == 'analytics':

            return self._db

        else:

            return None



    def db_for_write(self, model, **hints):

        """

        Writes always go to analytics.

        """

        if model._meta.app_label == 'analytics':

            return self._db

        else:

            return None



    def allow_relation(self, obj1, obj2, **hints):

        """

        Relations between objects are allowed if both objects are

        in the default/slave pool.

        """



        if obj1._meta.app_label == 'analytics' or \

                obj2._meta.app_label == 'analytics': 

            return True

        else:

            return None



    def allow_migrate(self, db, model):

        if db == self._db:

            return model._meta.app_label == 'analytics'

        elif model._meta.app_label == 'analytics':

            return False

        else:

            return None

To enable the two routers, we need to add them in our global settings:

DATABASE_ROUTERS = ['core.routers.AnalyticsRouter', 'core.routers.MasterSlaveRouter']

Here the order of routers is important. All the queries for analytics are routed to the analytics database and all the other queries are routed to the master database or their slaves according the nature of queries. For now, we have not put slaves for analytics database but as the usage grows that will be fairly straightforward to do. At the end, we had an architecture where we could spin off new read replicas and route the queries fairly simply and had a high performance load-balancer in front of the databases. All this has resulted in a much higher uptime and stability in our application, and we could focus more on what we love to do — building products for programmers. We already had an automated deployment system in place, which made the experimentation easier and enabled us to test everything thoroughly. The refactoring and optimization that we did in codebase and architecture also reduced the server count by more than two times. This has been a huge win for us, and we are now focusing on rolling out exciting products in the next few weeks. Stay tuned!

I would love to know how others have solved similar problems. Do give suggestions and point out potential quirks.

P.S. You might be interested in The HackerEarth Data Challenge that we are running.

Follow me @vivekprakash. Write to me at vivek@hackerearth.com.

This post was originally written for the HackerEarth Engineering blog by Vivek Prakash.

Tech Tutorials

Continuous Deployment System

This is one of the coolest and most important things we recently built at HackerEarth.

What's so cool about it? Just have a little patience, you will soon find out. But make sure you read till the end :)

I hope to provide valuable insights into the implementation of a Continuous Deployment System(CDS).

At HackerEarth, we iterate over our product quickly and roll out new features as soon as they are production ready. In the last two weeks, we deployed 100+ commits in production, and a major release comprising over 150+ commits is scheduled for launch within a few days. Those commits consist of changes to backend app, website, static files, database, and so on.

We have over a dozen different types of servers running, for example, webserver, code-checker server, log server, wiki server, realtime server, NoSQL server, etc. All of them are running on multiple EC2 instances at any point in time. Our codebase is still tightly integrated as one single project with many different components required for each server. When there are changes to the codebase, you need to update all the related dedicated servers and components when deploying in production. Doing that manually would have just driven us crazy and would have been a total waste of time!

Look at the table of commits deployed on a single day.

And with such speed, we needed an automated deployment system along with automated testing. Our implementation of CDS helped the team roll out features in production with just a single command: git push origin master. Also, another reason to use CDS is that we are trying to automate everything, and I see us going in right direction.

CDS Model

The process begins with the developer pushing a bunch of commits from his master branch to a remote repository, which in our case is set up on Bitbucket. We have set up a post hook on Bitbucket, so as soon as Bitbucket receives commits from the developer, it generates a payload(containing information about commits) and sends it to the toolchain server.

The toolchain server backend receives the payload and filters commits based on the branch and neglects any commit that is not from the master branch or of the type merge commit.

    def filter_commits(branch=settings.MASTER_BRANCH, all_commits=[]):

        """

        Filter commits by branch

        """

        commits = []



        # Reverse commits list so that we have branch info in first commit.

        all_commits.reverse()



        for commit in all_commits:

            if commit['branch'] is None:

                parents = commit['parents']

                # Ignore merge commits for now

                if parents.__len__() > 1:

                    # It's a merge commit and

                    # We don't know what to do yet!

                    continue



                # Check if we just stored the child commit.

                for lcommit in commits:

                    if commit['node'] in lcommit['parents']:

                        commit['branch'] = branch

                        commits.append(commit)

                        break

            elif commit['branch'] == branch:

                commits.append(commit)



        # Restore commits order

        commits.reverse()

        return commits

Filtered commits are then grouped intelligently using a file dependency algorithm.

    def group_commits(commits):

        """

        Creates groups of commits based on file dependency algorithm

        """



        # List of groups

        # Each group is a list of commits

        # In list, commits will be in the order they arrived

        groups_of_commits = []



        # Visited commits

        visited = {}



        # Store order of commits in which they arrived

        # Will be used later to sort commits inside each group

        for i, commit in enumerate(commits):

            commit['index'] = i



        # Loop over commits

        for commit in commits:

            queue = deque()



            # This may be one of the group in groups_of commits,

            # if not empty in the end

            commits_group = []



            commit_visited = visited.get(commit['raw_node'], None)

            if not commit_visited:

                queue.append(commit)



            while len(queue):

                c = queue.popleft()

                visited[c['raw_node']] = True

                commits_group.append(c)

                dependent_commits = get_dependent_commits_of(c, commits)



                for dep_commit in dependent_commits:

                    commit_visited = visited.get(dep_commit['raw_node'], None)

                    if not commit_visited:

                        queue.append(dep_commit)

            

            if len(commits_group)>0:

                # Remove duplicates

                nodes = []

                commits_group_new = []

                for commit in commits_group:

                    if commit['node'] not in nodes:

                        nodes.append(commit['node'])

                        commits_group_new.append(commit)

                commits_group = commits_group_new



                # Sort list using index key set earlier

                commits_group_sorted = sorted(commits_group, key= lambda

                        k: k['index'])

                groups_of_commits.append(commits_group_sorted)



        return groups_of_commits

The top commit of each group is sent for testing to the integration test server via rabbitmq. First, I wrote code which sent each commit for testing, but it was too slow. So Vivek suggested that I group commits from payload and run a test on the top commit of each group, which drastically reduced number of times tests are run. Integration tests are run on the integration test server. There is a separate branch called test on which tests are run. Commits are cherry-picked from master onto test branch. Integration test server is a simulated setup to replicate production behavior. If tests are passed, then commits are put in release queue from where they are released in production. Otherwise, the test branch is rolled back to a previous stable commit and clean-up actions are performed, including notifying the developer whose commits failed the tests.

Git Branch Model

We have been using three branches — master, test, and release. In the Master, the developer pushes the code. This branch can be unstable. Test branch is for the integration test server and release branch is for the production server. Release and test branches move parallel, and they are always stable. As we write more tests, the uncertainty of a bad commit being deployed to production will reduce exponentially.

Django Models

Each commit(or revision) is stored in the database. This data is helpful in many circumstances like finding previously failed commits, relating commits to each other using file dependency algorithm, monitoring deployment, etc. Following are the Django models used:* Revision- commithash, commitauthor, etc. * Revision Status- revisionid, testpassed, deployedonproduction, etc. * Revision Files- revisionid, filepath * Revision Dependencies. When the top commit of each group is passed to the integration test server, we first find its dependencies, that is, previously failed commits using the file dependency algorithm, and save it in the Revision Dependencies model so that we can directly query from the database the next time.

def get_dependencies(revision_obj):

    dependencies = set()

    visited = {}



    queue = deque()

    filter_id = revision_obj.id

    queue.append(revision_obj)



    while len(queue):

        rev = queue.popleft()

        visited[rev.id] = True

        dependencies.add(rev)

        dependent_revs = get_all_dependent_revs(rev, filter_id)



        for rev in dependent_revs:

            r_visited = visited.get(rev.id, None)

            if not r_visited:

                queue.append(rev)

    #remove revision from it's own dependecies set.

    #makes sense, right?

    dependencies.remove(revision_obj)

    dependencies = list(dependencies)

    dependencies = sorted(dependencies, key=attrgetter('id'))

    return dependencies 



def get_all_dependent_revs(rev, filter_id):

    deps = rev.health_dependency.all()

    if len(deps)>0:

        return deps



    files_in_rev = rev.files.all()

    files_in_rev = [f.filepath for f in files_in_rev]



    reqd_revisions = Revision.objects.filter(files__filepath__in=files_in_rev, id__lt=filter_id, status__health_status=False) 

    return reqd_revisions

As we saw earlier in the Overview section, these commits are then cherry-picked onto the test branch from the master branch, and the process continues.

Deploying to Production

Commits that passed integration tests are now ready to be deployed. There are a few things to consider when deploying code to production, such as restarting webserver, deploying static files, running database migrations, etc. The toolchain code intelligently decides which servers to restart, whether to collect static files or run database migrations, and which servers to deploy on based on what changes were done in the commits. You might have noticed we do all this on the basis of types and categories of files changed/modified/deleted in the commits to be released. You might also have noted that we control deployment to production and test servers from the toolchain server (that's the one which receives payload from bitbucket). We use fabric to achieve this. A great tool indeed for executing remote administrative tasks!

from fabric.api import run, env, task, execute, parallel, sudo

@task

def deploy_prod(config, **kwargs):

    """

    Deploy code on production servers.

    """



    revision = kwargs['revision']

    commits_to_release = kwargs['commits_to_release']



    revisions = []

    for commit in commits_to_release:

        revisions.append(Revision.objects.get(raw_node=commit))



    result = init_deploy_static(revision, revisions=revisions, config=config,

                                commits_to_release=commits_to_release)

    is_restart_required = toolchain.deploy_utils.is_restart_required(revisions)

    if result is True:

        init_deploy_default(config=config, restart=is_restart_required)

All these processes take about 2 minutes for deployment on all machines for a group of commits or single push. Our life is a lot easier; we don't worry anymore about pushing our code, and we can see our feature or bug fix or anything else live in production in just a few minutes. Undoubtedly, this will also help us release new features without wasting much time. Now deploying is as simple as writing code and testing on a local machine. We also deployed the hundredth commit to production a few days ago using automated deployment, which stands testimony to the robustness of this system. P.S. I am an undergraduate student at IIT-Roorkee. You can find me @LalitKhattar.

This post was originally written for the HackerEarth Engineering blog by Lalit Khattar, Summer Intern 2013 @HackerEarth