Modeling Data

A Simple Example Model

Let us take a simple example of modeling a User. A naive approach would start with defining a type representing the model for User. This can be done in a programming language such as Rust. Even someone without any programming language experience can understand the model below.


#![allow(unused)]
fn main() {
struct User {
    first_name: String, // Required
    last_name: String,  // Optional
    age: u8,            // Optional. Must be > 18
    email: String,      // Required. Must follow valid email format
}
}

Refining the Model

From a domain modeling perspective, let us consider a few improvements to the above model which are right now communicated using comments in the snippet.

Only first_name and email are required fields, rest are optional.
All our users are aged above 18.
Not all Strings are valid email addresses. For starters, we want to guarantee they have the right email format (someone@somewhere.som).

Representing optional data

There are straightforward ways to incorporate optional information in most programming languages today. One can encode them in Rust as below.


#![allow(unused)]
fn main() {
struct User {
    first_name: String,         
    last_name: Option<String>,  
    age: Option<u8>,           // Must be > 18
    email: String,             // Must follow valid email format
}
}

Constraining the domain of values

Let us tackle the email format problem next. Essentially what we are saying is that a String is not the most appropriate representation of an email address. This where we use the type system to help us. Let us define a new type for email, EmailAddress, to clearly model the above idea.


#![allow(unused)]
fn main() {
// New type representing an email address
struct EmailAddress(String);

struct User {
    first_name: String,         
    last_name: Option<String>,  
    age: Option<u8>,           // Must be > 18
    email: EmailAddress,       // Must follow valid email format
}
}

This model is better. It is very clear that there is more to an email than just any plain String.

But that has not helped much with the definition of EmailAddress itself. It still seems to say the same, that it is just a String.

If you are a developer, you can think of several ways to code this up.

You can code up the whole thing in the constructor (or builder) for EmailAddress type. This is completely opaque from a model perspective.
You can explicitly capture the regular expression representing an email to ensure that everyone can see it (in the spirit of transparency).
You can also make it clear to everyone that the constructors can fail by using an appropriate return type such as Result that can either return a valid User or an error.


#![allow(unused)]
fn main() {
// New type representing an email address
struct EmailAddress(String);

// Email address errors
enum EmailAddressError {
   InvalidFormat(String),
}

// Regular expression used to validate email addresses
// Trust me, we use it in the code below.
const EMAIL_REGEX: &str = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$";

impl EmailAddress {
   fn new(addr_string: String) -> Result<EmailAddress, EmailAddressError> {
      // Coming soon: The code for validating and constucting a new instance
      todo!()
  }
}

struct User {
    first_name: String,         
    last_name: Option<String>,  
    age: Option<u8>,           // Must be > 18
    email: EmailAddress,       // Must follow valid email format
}
}

While this may be an improvement, there are several shortcomings.

The regular expressions (RegEx) are not exactly very friendly (at least in my humble opinion). But they are a necessary evil and we may live with it. At least, there are standard expressions that you can use from web without fighting over their correctness during your modeling time for most common data types.
But as more and more validations are added, we are going to move more model aspects to code. What if we wanted to express that the email address should be less than 128 characters?
Things can easily get very complicated. What if the email address is to be validated by an external service or against a registered user database? In many schema based environments like GraphQL, they are addressed by field level resolvers. Now, both programmers and modelers have to learn a new modeling language to communicate with each other. We are not suggesting that it is wrong, but just pointing out how frictions arise in modeling even simple things. How implementation and environment details creep into the safe modeling space we were trying to create for ourselves!
Even with this simple example, one can see that the worlds of modeler and programmer are starting to diverge while their concerns remain entangled. As more model questions needing clarity emerge, more details seem to submerge in code. For all the slogans such as code is documentation and model is code that we embrace out there, we seem to have to resort to extraordinary measures to capture and communicate even the simplest of modeling requirements.

To be fair, there are aspects of model and computations that ultimately have to be coded and that require the specialized programming skills. But we should at least be able to separate them cleanly, keep them far away from the common concerns that impact diverse stakeholders and delay them till design and implementation time.

Getting back to our example, a common solution is to provide appropriate constructors or builders for a each type, so that their interfaces can represent the domain constraints better. These builders with combinators is a really powerful tool. The challenge is the broiler plate code that we end up developing and porting across different environments. We will definitely embrace builder patterns and combinators in our model. But we try to do that in a generic way without all the ceremony!

An important point that we might be glossing over here. The constructors and builders deal with constructing instances for a given type. In our discussion, we are starting to talking about constructing types themselves. We have casually entered the meta realm.

Issues of state

Before exploring the solutions along the lines we discussed above, let us discuss a few more challenges with these approaches.

We just modelled a general user in our system and boldly proclaimed that certain fields are required and others are optional. This represents a programmer bias, instance bias, where we take an instance creation to also represent a single instant in time.

A typical instance creation is coded as shown below.


#![allow(unused)]
fn main() {
// Hiding the User definition for brevity. Copied from previous snippet
struct EmailAddress(String);

struct User {
   first_name: String,         
   last_name: Option<String>,  
   age: Option<u8>,           // Must be > 18
   email: EmailAddress,       // Must follow valid email format
}

let user = User {
    first_name: "John".to_string(),
    last_name: Some("Doe".to_string()),
    age: Some(42),
    email: EmailAddress("john@doe.com".to_string())
};
}

For a developer, the mental association of an instance creation is that of invoking a constructor, which happens instantly. This is not how things happen in the real world.

For instance (no pun intended), we may have a guest user we want to track who may not have any associated data in our system until after the user registers with the system. Once a guest user completes the registration process, the rest of the information becomes available.

Now, the simple, nice, clean data model we developed for our User is starting to unravel! This is a problem state which is very common in models of any kind. The available information (aka data) and expected behaviors of domain object do depend on state.

These scenarios are so ubiquitous, there are many tricks of the trade that we have developed over time. Over one third of the use cases for Business Process Management (BPM) systems involve handling some form of this synchronization of state with data representation.

Most solutions involve abandoning the required fields altogether and adding additional fields either in the same model or creating separate User models corresponding to different user states. The model consideration have left the model space, disappearing behind a think fog of code and process diagrams.

The builder pattern does not address this modeling problem.

In addition to the representation problem in the time dimension, there is a problem in the spatial dimension too. Different microservices that constitutes our modern systems may have slightly different views of the model while referring to the same domain entity. Even in monolithic systems, there are representational differences across different layers or between frontend and backend subsystems.

For example, there may be a password field for User, which should not be available to any parts of the system, except the one responsible for user authentication perhaps.

User may have an ID field that is both required and non-null, not when it originates from a web form for new user creation, but definitely on its return journey from the server and all times thereafter.

What a User is depends upon who is asking and when in the application. The answer is context dependent! But there is still the essence of it all that is invariant in the system, a single culprit with different witness descriptions.

Relationships in Model

Relationships are part of life and models. They can be challenging to manage in both cases as well.

By representing email addresses as a different kind of model entity, we already introduced relationships into our model.

Every User has an email address

That seemingly innocent and straightforward sentence introduces a number of concepts relating to relations into our model.

Relatedness: Two entities are related to each other in some way.
Directionality: The relationship is from User entity to EmailAddress. There is nothing in our model the reverse relation from EmailAddress to User, at least in the model we developed so far.
Degree of Dependency: The User contains an EmailAddress_. We may express the nature of relationship between entities using different nomenclature depending on our professional upbringing. We strive to capture the degree and nature of these relationships in terms of association (weak form), aggregation, composition, containment etc. in some terminology.
Cardinality: Every User has one EmailAddress. Here it is expressed by single, required field embedded into the User model. BTW, it does not exclude the possibility of multiple users having same email, which is probably what we intended to model. This requires a uniqueness constraint. We have definitely rules out the possibility of a User having multiple EmailAddresses, whether that was our intention or not. As you can see, we want to be able to reason with one-to-one, one-to-many and many-to-many relations between model elements or entities.
Lifecycle: There are questions about the the life of related entities in our model. Is an EmailAddress valid even after the associated User is removed from the system? What happens when an EmailAddress is updated, invalidated or removed? Should that be allowed outside the context of a User? What if we were tracking different communications User had in our system which is linked to email?

Such modeling concepts and their primitives are predominant among database community. They are also very much part of formal modeling systems such as UML. But they are mostly incorporated into the programming interfaces through broiler plate code. In many cases, this means that we may express the relationship inconsistently across different parts of the system. The model of User for the database would contain this constraint, but the Application Programming Interface (API) would not! Shouldn't it be possible for all stakeholders and all systems to be able to have same understanding of model constraints without actually worrying about where and how they are implemented?

The moral of the story here is that modeling decisions in the presence of complex relations and their representation and communication is hard, even without the complexities of globally distributed, heterogeneous environments where modular subsystems can independently evolve.

Models of Data

Crossing the chasm

Data modeling has a long and storied history. It is a rich and well established field well supported by a thriving database community. The databases and their schemas have been powering most of the systems out there. When we refer to model in an application context (Model, View, Controller (MVC) pattern as example), we are most often referring to the data, usually stored in some database. Yet there seems to be a divide between the modeling aspects of data as it pertains to data layer and computation (or application logic) layer. One focuses more on data at rest while the other is concerned about the data in motion (transition). But both have to deal with the dynamics of the system (the changing states).

Data community does this by shoving more status fields into their tables and documents while the application community (both client and server side) deals with them by writing a truck load of code in the name of controllers and logical blocks. The very essence of our system live in the wild wild west of broiler plate code that is built to tie these disparate worlds together.

Can we have a unified model of a domain that crosses these boundaries of client, server, middleware and databases? Can we do this without all the ceremony and fanfare?

The quote below summarizes our troubles and tribulation with modeling so far.

"Domain entities are modal and dynamic, while our data models are static."

This is what we attempt to change. Before we can address the above challenges, let us also look at the challenges with adding behaviors to our model.

Metals - Meta Programming Language System