Qi4j and the NoSQL movement

The following entry was originally posted on Rickard’s blog. Jayway is a founding company of the Qi4j project.

The second presentation from JavaZone 2009 that I want to comment on is “På tide å kaste ut relasjonsdataben?” (Is it time to throw out the relational database?) by Trond Arve Wasskog, which continues the current trend of looking at alternatives to relational databases for persistence.

For myself, I have for some time argued that most people seem to be using relational databases for four separate things: storing objects, querying them, reports, and backups. In my view it is only really good at the reporting part and literally suck at the rest. The object-relational impedance mismatch is a well-known issue, that DDD values are hard to implement using OR mappers also seems to be common knowledge, and that backups are not exactly efficient or easy to make is also an issue.

In Qi4j we have explicit SPI support for storing and querying objects, separated from each other. Typically the EntityStore SPI will be implemented by a key-value implementation, the benefits of which was described in the previous post. The EntityFinder SPI is currently only implemented by an RDF repository extension based on Sesame. RDF, and the SPARQL query language, allows for queries that are much more object-oriented in nature than SQL, and while it takes a while to get used to it, most of that is hidden behind the Query API in Qi4j. Reporting is not directly supported in Qi4j, but in my own project StreamFlow I just implemented an application service that consumes all events from the domain model, as we are using CQS and Event Sourcing, and from that generates denormalized report data in a MySQL database which our customer can use to make the statistics and reports they need. All without ever touching or caring about the key-value store that we use for our domain model.

Isn’t that great? Isn’t that obviously good? Isn’t that the obvious way to deal with all these headaches that developers have been complaining about for, oh I don’t know, the last 10 years or so (at least)? Sheesh. Sometimes I feel like software developers today are much like Pavlovs dogs. They get continuously electrocuted with bad ways of doing things, and at first they react, but after some time they just take it as normal, and will not figure out ways to stop getting electrocuted. Amazing, from a psychological point of view. But I digress.

Here are some specific comments on the points made in the presentation.

Entity modeling

At 3:48 Trond shows an ER diagram of a small system. This shows a typical OO diagram with properties and relationships between entities. For me, now that I have become accustomed to thinking in terms of composites, where I want to separate responsibilities within entities into bounded contexts, the diagram has a couple of contexts that to me are overlapping. For example, the OrderItem has OrderItemId, OrderId, ProductId and Qty, which would be logical to have for the initial order. Then, it also has MinDeliveryQty and PendingQty which are more related to executing the order delivery. These two concepts, the order itself and its delivery, is mixed into one diagram. To me this is a bad thing.

In Qi4j, instead of thinking of entities as monolithic objects you always start by decomposing them into the various usecases. For the given ER diagram you would think of “placing the order” and “delivering the order” as two scenarios. These would then each result in interfaces that the Entity needs to implement, separated from each other. This allows many scenarios on the same Entity to coexist without disturbing each other, and is one of the key benefits of using a composite model. All those bounded contexts in simple cases just translate into mixin interfaces that the entities implement. On disk they are one, meaning, if you look at the state on disk it’s going to be similar to what is shown in the ER diagram, but from a domain model point of view they will be separate, and can evolve separately.

Relations are defined at designtime

At 7:43 Trond discusses the point that when you are designing your entities that will immediately be transferred down into the database schema, which then becomes a point of inertia of change. Evolving the domain model becomes tricky, as migrating the data has to be done at the same time.

When you are using a key-value store there is no fixed schema defined for your objects, so each object could really be different. There is no strict requirement to migrate “all the universe” at once, but this can instead be done over time, if your domain allows such things. Also, with regard to Qi4j, if you are using a key-value store then the entity design is ONLY done in the Qi4j Entity composites. Once you are done and start the system the data, with associations and properties, will be automatically stored without the need to define any mappings or somesuch. It will “just work”.

This drastically reduces the time needed to design an entity model, as all the overhead for dealing with RDBMS schemas and mappings just goes away. This feature also avoids the next problem Trond discusses at 08:47, recursive structures. In Qi4j there is no problem with allowing Entities to have recursive references, as the EntityStore SPI will handle all of that for you. If you need to store complex values, then they will be serialized to JSON strings by default, and stored in one field, so that is also handled automatically, and in a way that you can manually look at the data and do data migration if you want to. This also handles the next problem that Trond talks about at 09:18, which is that domain model information tends to get spread out all over the application, and especially into the database. With Qi4j, as the entity definition is entirely done in the model, and the EntityStore will use that to automatically store the data, the problem goes away.

Handling change

At 10:50 Trond discusses the problem of change with relational databases, which is related to the previous topics already discussed. One issue here is purely social: when a developer makes a change to the model the DBA has to implement it in the database. Why is this necessary at all? I think one reason is because the DBA is the person who keeps track of all the applications that access the same database, and ensures that they all continue to work after changes have been made. This is a side-effect of having the four responsibilities mentioned above dumped into one technology. If the domain model instead resided in its own application store, which is key/value based, and reporting and integration is instead done based on events from the domain model, then the application is freed from these restrictions, and it becomes much easier to do change without affecting other applications that use the same data.

Key-value stores

At 16:55 Trond talks about one of the main options to using relational databases, that is, key-value stores. There’s a whole bunch of them available now, the most famous one of which would have to be BerkeleyDB. For myself, I have been using JDBM for some time, mostly out of habit. But the main point is the architectural possibilities you get by using something like this. For some reason Trond is focusing on network-based key-value stores, such as SimpleDB and the Google database. They both embrace the EntityAttributeValue(EAV) model, which is great, but to me they also miss one of the points of a key-value store: the local access which minimizes latency which in turn makes it soooo much easier to deal with all the issues outlined in the previous post. If you have a key-value store in a data-center somewhere else the first thing you would have to do is add a local cache to get decent performance, and then the question is what’s the point of it all. I would prefer to have local databases using JDBM or BerkeleyDB, which has awesome performance, and then use replication to get all nodes to get the same data. If it’s too much data to have at each node, then consider a networked store with a cache solution, but it wouldn’t be my first option for all the mentioned goodness reasons of local key-value stores. Tronds summary of available products at 35:00 include more examples of local-yet-replicated datastores, so start from that and look at what each store provides. From the Qi4j perspective, it should be trivial to implement EntityStore implementations for all of them, since our SPI is based on the EAV model.

Schemas

At 41:0 Trond mentions the obvious, that these databases put the responsibility of schemas on the application developer. This is something that Qi4j brings to the table when it comes to dealing with key-value-stores: by defining the Entities as composites in Qi4j, where all properties and associations are defined, the application developer does not have to deal with this. As long as you have defined the Entities Qi4j will do the mapping from live Java object to the datastore for you. As this can be one of the main things keeping people from looking at key-value-stores I would encourage you to think about this when evaluating whether to use key-value-stores, and also whether to use Qi4j. How many other frameworks gives you access to key-value-stores in a consistent and easy-to-use way?

Joins and queries

As Trond points out at 42:35 (and later at 43:50) there is no support for “joins” and “queries” in key-value-stores, and neither should there be. Queries should be done using technologies which are good at indexing and querying, such as an RDF store. In Qi4j queries are not done through the key-value-store so this “drawback” goes away entirely. By combining a key-value-store, which has blazing performance for loading and storing, with a query engine that is optimized for that, you are using each technology for what it does best, but accessed through an API that makes the distinction seamless for the application developer. This is yet another thing that Qi4j provides as a benefit for the application developer.

The possible exception to this would be if you use Neo4j as the persistence solution. Neo4j would be good both at storing the data and doing advanced queries based on relationships between entities. In that case Neo4j would both implement the EntityStore SPI and EntityFinder SPI in Qi4j.

Datatypes

When it comes to datatypes the EntityStore SPI in Qi4j is such that if the underlying datastore has native support for a type, such as Long or Date, then that can be used. Otherwise those types can either be stringified or serialized (and then stringified using base64 encoding) so that any type can be stored. This shields the application developer from having to deal with any such limitations in the underlying store. Complex values that have some kind of internal structure are best implemented as ValueComposites, which can then be saved as JSON strings in the key-store automatically.

Aggregation

At 44:45 Trond mentions that there is no support for aggregation in key-value-stores. Those kinds of needs typically relate to reporting, and reporting should not be done in a key-value-store. Instead your application should generate events, which can be used to create relational data, which then makes it possible to use aggregation or similar types of SQL functions easily. The point here is to use each technology for what it is good at, and DON’T use it for things it is NOT good at. Blindingly obvious isn’t it? The main problem developers seem to have is that they are trying to use one tool for everything, rather then having many tools that are specialized. As patterns like CQS and Event Sourcing become more popular, and supported through frameworks, I think we will move more towards this style of architectures.

Some people say that we should go “polyglot”, i.e. use multiple languages, where each language is the best for the job. The same goes for persistence technologies: we should use multiple persistence technologies, where each is specific for what it is supposed to do. The main problem then becomes how to integrate them nicely, and I think the EventSourcing ideas, and everything that goes with that, is what we will be using as solution.

Reference integrity

Tronds also points out that there is no support for maintaining referential integrity in these stores. If your store is using events, then you can have consumers that consume those events (e.g. when an entity is removed) and traverse the database, either linearly or by using queries, to find reference problems, and then fix them by triggering new events that fix it.

Conclusion

In Tronds conclusion he puts a lot of emphasis on price. If solutions based on key-value-stores are cheaper than the current option, then that will drive adoption. I agree, but such a calculation must include all of the things mentioned above, such as less need for DBA’s, less need for schemas, less mapping code, less waste in general, cheaper software, and so on. But at the same time it has to, on the other side, include costs for handling many persistence technologies concurrently (which will be necessary in most cases). If you have an EventSourcing architecture, based on Qi4j, this cost should be minimal, but if you want to DIY then it might be a quite considerable. An important point that Trond makes is that this model makes it much easier to have a domain model which you can evolve without disturbing other clients, as they will NOT be integrating with the application store as such, but will instead be working with the event stream.

Leave a Reply

Close Menu