Dive Deep in Designing Messaging System

9 min readMar 21, 2022

Microservice is a popular architecture design for highly scalable systems which are embracing constant changes and evolution. This is in high demand from a business perspective nowadays. Despite good ideas of microservices, I have seen many failures of microservices where misunderstanding of key concepts leads to anti-patterns. In this article, I’ll try to touch on important principles of microservice while designing a messaging system.

I have seen a lot of successful monolithic designs in the past. However, their release cycles are longer than microservices most of the time. If your goals align with monolithic design, then it is totally fine.

Let’s take a look at key characteristics of microservices:

loosely coupling: the idea of microservices is to split a system into smaller services that can be run and deployed independently.
highly cohesive: it refers to a standalone service running on its data without dependency or with few dependencies.

In my opinion, designing microservice is not hard as many techniques are well-developed before microservice was promoted. Here are some best practices, which you can do to make sure to have the correct microservice design:

Domain Driven Design: it is used to identify bounded contexts, which is the foundation for a microservice. There is a rule that a service implementing a bounded context should be small enough to be implemented within two weeks. This is just a hint.
Consistent Async API: asynchronous programming model not only enables asynchronous operations but also allows services to run in a highly decoupled fashion. The event-based approach is a great fit for microservice. By ‘consistent’, I mean that it is a bad idea of having too many types of API in microservices. A couple or few of them such as custom RPC, Rest, and GRPC are perfectly good.
No-shared-data: database sharing leads to high coupling between services, it also refers to as data coupling. Try to avoid it. In practice, we can do a further step: data partitioning/sharding database used by a microservice. It boosts significantly service performance.

I would say that these design practices have been used in designing systems for long time, but only microservice highlights them to a great extend of their importance and usefulness.

For sake of simplicity, assume that we are going to design a messaging system with the following basic requirements:

allowing users to send messages to each other and to send messages in a chat group;
each user has a list of friends and groups and is able to add/remove/block friends;
each user has a profile which contains his name, phone, avatar, etc

Let’s get started with bounded contexts as we use them as the basis for building services in the messaging system.

Identify Bounded Contexts

It is considered a good approach to start with a monolithic design at the beginning, assuming that there are many unknowns of the system. After a few development iterations, you are able to gather important insights of the messaging system with clearly defined bounded contexts (Indeed my team used this approach). I think this step is crucial to shaping your microservice design. And if it is properly done, then you get a design that could last for years.

Let’s check found bounded contexts:

UserProfile: this context is about user management in the message system. If looking into it closely, we could further identify unrelated subcontexts like personal information, preferences, privacy and security settings, etc. For simplicity, we just use UserProfile for now.
ContactList: it is initially included in UserProfile. As ContactList describes many-to-many relationships between users, I found that it is easy to handle its complexity in isolation.
OnlineStatus: user's online status is important information for the messaging system to implement real-time message routing. If a user is online, he can receive messages in real-time; if a user is offline, a message is pushed to a queue and stored as an offline message for him. This is why we have this third bounded context OnlineStatus, where we store all information about the user's online status: user IP address, client app version, and the user may log in using multiple devices.
GroupChat: this context handles all group management (such as group permissions, admin, …) and message routing.

By implementing these bounded contexts we turn the monolith into microservices. Notice that microservice needs a service discovery mechanism to send events to the correct service instance. A message broker may provide a better choice here (we’ll talk about it shortly).

Event-based approach

We particularly focus on an event-based approach, which allows us to have loosely decoupling between services. To enable the event-based approach, first we need to strictly define the event message format in which all services agree on sending/receiving events. Event transport protocol could be REST, GRPC, Thrift, or even custom protocol for internal communications; REST and WebSocket for external communications. Secondly, we make sure that implementations of transport are available in stacks that you are going to use in your system.

In the messaging system, we choose REST and WebSocket for external third parties and custom protocol based on protobuf for internal communications. We choose protobuf because it is space efficient and handling protobuf object is quite easy on most platforms.

We use Kafka as a message broker to implement this design. Rabbitmq is also a good alternative. We found that it is not easy to do sharding based on rabbitmq message routing and need to install the x-hash plugin for that.

It is worth mentioning that using the event-based approach with Kafka eliminates the need for service discovery. A service only needs to know to which Kafka topic it should push a message.

It seems that we are locked into Kafka stack, but if you implement distributed consistent hash for load balancing, you are probably locked into other stacks like zookeeper, which is used for service discovery. Moreover, you are at risk of implementing distributed consistent hash at each service instance. So, unless you have a reason to do it, consider your choice carefully here.

Event Format and Service API

We consider supporting the front end on different platforms: ios, android, and web, it is natural to use JSON messages for communication with the system. The JSON format is flexible, but it consumes more bandwidth and processing than Protobuf. However, Protobuf parsing is not stable in javascript (at the time we evaluated it several years ago), so we support both protobuf and JSON for frontends and use Protobuf for internal communications. We define a Protobuf-based transport protocol for the messaging system. The event message format is defined as follows:

event_message := <header> <payload>
header := <api_id> <api_version> <message_id> <reply_to>
payload := <protobuf-based-data-of-message>where:
- api_id: id of service API interface
- api_version: the version number of service API interface
- message_id: this number identifies message format in content
- reply_to: the endpoint that receives a response from this request
- payload: protobuf-based data

Using Protobuf allows us to extend events easily while having backward compatibility. We also have api_version to determine the version of the event in the payload. The semantic version is a good choice here. We can also support multiple API versions at the same time.

Data Partitioning

Even if we adopt the no-shared-data approach, we still have some data coupling issues that need to address. In microservices, if an API involves multiple updates across two or more services, then we have data coupling between involved services. Here is a tradeoff between performance and data consistency. Using a strict data consistency policy may hurt service performance, while eventual consistency may have issues of data freshness.

We also consider database sharding for further boosting service performance. We can split the database table into smaller tables, let’s say 100 or even 1000. This will enable high parallelism and reduce read/write lock contention. By owning a smaller table, a service instance can handle data without worrying about any data race issue. However, there is another type of data coupling for some services, for example in ContactList when user A adds user B to his contact list, we need to make sure that the contact lists of both A and B are updated. We choose strict consistency for this particular case, however, we can switch to eventual consistency if service performance degradation becomes noticeable.

Database partition really matters if you want to build high-performance system. Most of the time without database partition you will end up synchronizing data or complex collaboration across service instances, which hurts performance in one way or another.

Another issue is how to distribute user online status across instances of OnlineStatus. In order to route messages, we need to know which instance of OnlineStatus to send a message based on the receiver id. One solution is to use distributed consistent hash. However, we have a better solution: Kafka. We use Kafka partitions to split user id into several parts. All instances of OnlineStatus listening on the same Kafka topic will receive data from assigned partitions by Kafka. Kafka is also able to dynamically reassign partitions to instances if any instance is added or removed on the topic. In the words, we can scale easily. For more details, check here.

Event Routing

Taking advantage of Kafka stack, we can design a self-routing event mechanism. The idea is that a request event can travel through many services before having enough info to generate a response event. Once the response event is created, it can use info from the request event header to forward the response directly to the requested service. In event header, we use the ‘reply_to’ field for defining Kafka topic which a response will be sent to. This could reduce latency of the system significantly.

Database Design

Choosing the right database is important. The messaging system has a massive database writes to store messages. NoSQL is a great choice as it scales out easily.

In the beginning, my team used MySQL, it had worked well for many years, but eventually reached its limit. There were many available options, at the end we chose TiDB as it is MySQL compatible and NoSQL.

We notice that data has two categories: persistent and volatile. Persistent data includes user profile, contact list, etc while volatile data is changed over course of time such as user online status, user read status on message, etc. Storing persistent data in the database is perfectly sound, but updating volatile data for every change would kill database performance. Volatile data is better saved in a local cache of a service or in Redis/Memcache.

Designing the correct database schema is extremely crucial. The traditional approach often defines the following tables: user, message, message_recipient.

CREATE TABLE user (
    id bigint,
    name varchar(128)
);CREATE TABLE message (
    id bigint
    group_id bigint,
    content longblob
);CREATE TABLE message_recipient (
    message_id bigint,
    recipient_id bigint,
);

In this way, every message requires one insert on the message table and several inserts on the message_recipient table, which puts a lot of pressure on the database. Now instead of defining recipients in a separate table, we add recipients to message content, which is protobuf serialized data. So it not only removes the message_recipient table but also handles a dynamic number of recipients nicely.

In summary, we have discussed a few important aspects of designing high-performance messaging systems using a microservice approach: bounded contexts, event-based approach, and no-shared-data. This is just the core of the messaging system. It can be easily extended to have more features like classroom and notification support (Firebase Cloud Messaging and Apple Push Notification) in the below diagram.

Messaging System with ClassRoom and Notification Support

Another interesting issue is how to deploy this system in multiple data centers. I leave it to curious readers to figure it out.

Thanks for reading!

Dive Deep in Designing Messaging System

Event-based approach

Written by Jackie Dinh