Techniques for configurable python code
A Machine Learning motivated odyssey
githubThis post goes down a dangerous path. Since writing it, I've experienced more real-life situations and my perspective is different.
I now believe that the best way to write configurable code is not to use config files, but to use code. Every config language ends up being just a broken programming language. It's better in almost every way to use a battle-tested language rather than a DSL that nobody knows. Django does it ; AWS does it with CDK (IaC in TypeScript); etc.
See this interesting post : The configuration complexity clock
TL;DR : don't do anything fancy, use code. Or maybe keep it super simple with a flat structure (only key-values)...
As a Deep Learning Engineer, I’ve recently been thinking about clean ways to organize code and define pipelines. Here is an attempt to summarize my learnings.
- Introduction
- A simple example
- A need for configurable code
- Modularizing your code
- Using the modularized code
- A more complicated example
- A Machine Learning Perspective
- Conclusion
Introduction
Let’s start with the facts : writing code for a side-project or a class-project is very different from writing code for a large organization. Among other things, you face the following requirements :
- collaboration : it’s not only your code, so you need to make sure that the components you add play nicely with your teammates’
- compatibility over time : not only does your new feature need to work, it also has to integrate nicely with the existing logic, without breaking everything. What’s more, in the future, you want to avoid having to modify and update your code.
- flexibility in usage : you need to keep in mind that specifications are likely to evolve in the future, and you need to plan accordingly (as much as possible without overdoing it), as you might need to support a wider variety of use cases. Ideally, the design itself should be modular enough so that adding new functionality is easy.
While the previous points may seem obvious, it requires patience and experience to successfully address them, especially if you write python code : as a scripting language by nature, it’s easy to forget good software-engineering practices and dive right in. After all, running code is better than nice-but-unusable code.
A simple example
Let’s take a simple example : we want to compute 2 * x + 1
.
If I want to quickly put something together, I can just write a short script that does the job
def f(x):
return 2 * x + 1
f(2)
Of course this is a dummy example, but you can extrapolate, until you reach the complexity of an actual program. Once your initial script becomes too long, a natural thing to do is to create modules and helper functions, in an attempt to improve code reuse, effectively performing semantic compression. Now you have a script disguised as a “library”. And this is perfectly fine, if it’s the first iteration of a project, or if you only need to support one use case.
A need for configurable code
One day, the project manager comes to see you, asking to support a new use case : 2 * (2 * x + 1)
.
That’s easy, let’s do something like
def f(x, use_case):
if use_case == "use_case_1":
return 2 * x + 1
elif use_case == "use_case_2":
return 2 * (2 * x + 1)
else:
raise ValueError("Use case not supported.")
f(2, "use_case_2")
In real life, this means passing combinations of arguments to helper functions, resulting in something quite complicated to maintain. As different use cases keep coming, the number of if ... else ...
statements increases, reaching an unhealthy ratio. Soon, the combination of options forces parts of your code to support a combinatorial number of possibilities. If you have 2 main options with 10 possibilities each, and each combination requires some custom logic, that’s 10 x 10 possibilities! Chances are that in parts of the code that you may be less familiar with, a specific combination of options causes a failure. Hopefully you follow the guidelines of test-driven development and such a liability will be exposed before any release.
Modularizing your code
After a while, the expectations become more generic and you’re required to support “all combinations of 2 * x
and x + 1
”. Worse, you see a near future where other operations will have to be supported, like 3 * x
. You should probably isolate each of these operations from each other as well as how you combine them together.
After some time spent rewriting parts of the code to make it more modular, you come to the conclusion that each of these operations is independent from the others and that combining them together is another issue (Separation of Concern). Each operation should be responsible of one thing and one thing only (Single Responsibility Principle), while following the same contract.
In python, one the right ways of doing this is to define an “interface” for your operations (let’s call them layers
), using an abstract class
from abc import ABC, abstractmethod
class BaseLayer(ABC):
@abstractmethod
def forward(self, x):
raise NotImplementedError()
and implement different versions of that base class
class PlusOneLayer(BaseLayer):
def forward(self, x):
return x + 1
class TimesTwoLayer(BaseLayer):
def forward(self, x):
return 2 * x
Finally, chaining the layers is the job of some other class
class Model:
def __init__(self, layers):
self.layers = layers
def forward(self, x):
for layer in self.layers:
x = layer.forward(x)
return x
In this example we used object-oriented programming to modularize the code, but other options are also possible (using functions with similar signatures for example). However, the nice thing about OOP is that it lets you define contracts. Once implemented, your editor will help you find errors and it might speedup the whole development process while improving code robustness.
Using the modularized code
Okay so now we have abstracted and isolated the different components of the program. How do we use it?
With python scripts outside the library
The easiest way is to let the different users of your library define one script per pipeline. For each of the previous use cases, we end up with a script
# use-case-one.py
times_two = TimesTwoLayer()
plus_one = PlusOneLayer()
model = Model([times_two, plus_one])
model.forward(2)
and
# use-case-two.py
times_two = TimesTwoLayer()
plus_one = PlusOneLayer()
model = Model([times_two, plus_one, times_two])
model.forward(2)
At this point, it may look like we haven’t made a lot of progress. It turns out that in the process of making our code modular and reusable in a nice an abstract way
- we created a library (a collection of tools that can be easily re-used), in other words, we created a high-level API that users can build on (see how to design a good API and why it matters)
- we separated the layers’ implementation (don’t forget that this is a dummy example but the actual operations you are implementing are much more complicated) from usage. Actually, each of these scripts can be seen as a special configuration.
It is interesting to notice that the workflow manager airflow lets you define pipelines using a python interface (Directed Acyclic Graphs, Operators, etc.). From the documentation : “One thing to wrap your head around (it may not be very intuitive for everyone at first) is that [Airflow Python scripts are] really just configuration files specifying DAG’s structure as code”
While this may sound obvious, it is crucial to separate implementation from usage, especially because it’s so easy to mix the two and end up with a library that is part script-like and usage-specific, side-by-side with a collection of helper functions that may have otherwise been reusable for a wider variety of use cases.
With config files
While python files are probably sufficient in most cases (and this should probably always be possible because pipeline creators are likely to be programmers, and who knows what they will have in mind), some situations might benefit from the use of a more convenient pipeline definition format. Advantages and requirements may include
- Avoid duplication by splitting configs into sub-configs.
- Use a format that can easily be shared.
- Provide a way for non-programmers to define their own pipelines.
- Provide a lightweight, less-verbose way of defining pipelines.
There are a number of good formats that are widely adopted in the python community
.json
(JavaScript Object Notation), probably the most popular format, as it resembles python dictionaries..jsonnet
, built on top of json, adds support for imports, variable definition and much more, before “compilation” to a standard.json
..ini
(see configparser).yaml
.xml
Having said that, the question becomes : what do we write in these configuration files, and how do we reload them?
Usually, the first step would be to implement a way to create an object from a python dictionary. There are multiple ways of doing it
- define a
Serializable
interface and have each class implement a class methodfrom_params(cls, params)
that creates an object from a dictionary.class Serializable(ABC): @abstractclassmethod def from_params(cls, params): raise NotImplementedError()
For example
class Model(Serializable): @classmethod def from_params(cls, params): transforms = [] for transform_name in params["transforms"]: if transform_name == "times_two": transforms.append(TimesTwoLayer()) elif transform_name == "plus_one": transforms.append(PlusOneLayer()) else: raise ValueError() return Model(transforms)
This is basically what the
FromParams
class does in the AllenNLP library. - use a
Schema
approach. In other words, delegate the creation of objects from dictionaries to another class. This is probably the most common approach, but might be overkill in some cases. Have a look at the great marshmallow library.
Another tip : you might want to validate and normalize the dictionaries before creating instances from them. This can be useful to check for missing entries, fill-out default values, rename parameters, etc. I’ve been using cerberus for that purpose.
Now, our different use cases can be defined in simple .json
files
{
"transforms": ["times_two", "plus_one"]
}
and
{
"transforms": ["times_two", "plus_one", "times_two"]
}
Sharing and editing different pipelines is now even easier! In a way, our json
syntax is some kind of small “programming language” that lets us interface with our library in a minimalistic and convenient way.
A more complicated example
In the previous example, things were simple. We had very few classes, with reasonable dependencies. There was not a lot of parameters, object nesting, etc.
Let’s take a slightly more complicated example.
Let’s require each Layer
to define a name
attribute (this illustrates that dependencies usually have their own parameters), as well as depend on a Vocab
instance that will be shared among layers (this illustrates the need to support arbitrary hierarchies of dependencies).
In other words, we modify the code in the following way
class Vocab:
def __init__(self, words):
self.words
class BaseLayer(ABC):
def __init__(self, name, vocab):
self.name = name
self.vocab = vocab
Technically, in this example, the layers won’t use the vocab as it only represents some dependency they might have. In NLP, having a single vocabulary used in various places is very common, hence this example.
Now, defining our pipeline in python is still straightforward (and that’s why the first step towards configuration is to use plain python)
vocab = Vocab(["foo", "bar"])
times_two = TimesTwoLayer("times_two", vocab)
plus_one = PlusOneLayer("plus_one", vocab)
model = Model([times_two, plus_one])
model.forward(2)
But what about our nice json
format? If we adopt a backwards engineering approach, we can sketch what it could look like.
{
"transforms": [
{
"type": "TimesTwoLayer",
"params": {
"name": "times_two",
"vocab": ["foo", "bar"]
}
},
{
"type": "PlusOneLayer",
"params": {
"name": "plus_one",
"vocab": ["foo", "bar"]
}
}
]
}
The config file now contains almost all the necessary information. We can infer the Layer
classes using the "type"
entry, and use the "params"
to create instances of those classes. Let’s do it for the sake of completeness
class BaseLayer(Serializable):
@classmethod
def from_params(cls, params):
return cls(params["name"], Vocab(params["vocab"]))
class Model(BaseLayer, Serializable):
@classmethod
def from_params(cls, params):
transforms = []
for d in params["transforms"]:
if d["type"] == "TimesTwoLayer":
transforms.append(TimesTwoLayer.from_params(d["params"]))
elif d["type"] == "PlusOneLayer":
transforms.append(PlusOneLayer.from_params(d["params"]))
else:
raise ValueError()
return Model(transforms)
There are ways to improve the whole logic, for example we might use inspection to automatically resolve the class from its name or import string, or make the
Vocab
class alsoSerializable
.
It seems that we have achieved our goal, haven’t we?
Actually, there is an issue with the way the vocabulary is created : we actually created two identical yet distinct instances of the same vocabulary, while what we want is to share the same object between the transforms (generally some dependencies might be resource intensive and you want to avoid wasting resources).
This is almost a singleton kind of situation (almost, because we might have other vocabs elsewhere, it just turns out that these transforms need to share this one instance) and we can expect this kind of dependency to come up in different places.
We could modify our json schema to capture this information, update our
from_params
method, add some convention for object’s reuse and singletons, etc., but this goes beyond the scope of this post.
Configuration as Dependency Injection
This whole configuration process is actually a Dependency Injection problem.
What Dependency Injection mean
We want to create a pipeline made of components that hierarchically depend on each other. We want a way to create dependencies and inject them when creating objects that depend on it.
In our example, first we need to create the Vocab
, then create the different Layers
, “inject” the vocab dependency at creation time, and finally provide the layers when creating the Model
.
There are multiple ways of effectively implementing dependency injection. Our from_params
approach, though imperfect, could be improved to a state where it supports singletons, scoping etc.
Dependency Injection using Registries and Assemblers
In our example, the complexity stems from the multiple dependencies, and the fact that some objects are shared (the Vocab
is the same for all our Layer
).
A way to deal with a complex dependency pattern is to change the code and delegate the injection to specialized classes. For example, in the Vocab
case, we can create a VocabRegistry
in charge of providing the objects by name.
class VocabRegistry:
VOCABS = dict()
@staticmethod
def get(name):
return VOCABS[name]
and update the BaseLayer
into
class BaseLayer(ABC, Serializable):
def __init__(self, name, vocab_name):
self.name = name
self.vocab = VocabRegistry.get(vocab_name)
That way, the objects can be created independently, the “injection” being the Registry’s responsibility.
Another (close) way of resolving the complex dependencies is to have an assembler that puts everything together
class ModelAssembler:
@staticmethod
def from_dict(data: Dict) -> Model:
# Create vocab
vocab = Vocab(data["vocab"])
# Create layers
layers = []
for layer_data in data["layers"]:
layer_type = layer_data["layer_type"]
layer_name = layer_data["layer_name"]
if layer_type == "TimesTwoLayer":
layer = TimesTwoLayer(layer_name, vocab)
elif layer_type == "PlusOneLayer":
layer = PlusOneLayer(layer_name, vocab)
else:
raise ValueError(f"{layer_type} not recognized.")
layers.append(layer)
return Model(layers)
One of the issues of this approach is that it requires you to define assemblers or registries for each of your pipeline dependencies. If in a lot of cases, this won’t be too much of a problem, it can be limiting and forbid innovative uses of your library.
Dependency Injection using gin-config
The above solution requires a rather counter-intuitive redesign of the code, while increasing complexity to the detriment of readability. In some cases, this is fine, especially if intricate dependencies are not that frequent in the code.
In general, we would like to stick to simple abstractions, with dependencies being injected through the constructor (the __init__
method). This is at the same time more pythonic and easier to read.
It turns out that there exists nice solutions for dependency injection through config files in python. One of the best tools I’ve seen is gin-config, a dependency injection package for python built by Google.
If we were using gin
, the only thing we need to do is define a .gin
configuration file, that sets all the dependencies in a nice, lightweight, and composable syntax. After having annotated all the classes with a special decorator @gin.configurable
that allows you to tweak and define the dependencies (here our example is simple enough so that we don’t need to customize the decorator, but in some cases you might want to rename dependencies, provide defaults, require some dependencies to be defined even though the __init__
method has a default, etc.)
@gin.configurable
class Vocab:
pass
@gin.configurable
class TimesTwoLayer:
pass
@gin.configurable
class PlusOneLayer:
pass
@gin.configurable
class Model:
pass
Here is what the .gin
config file would look like in our case
import libnn.layers
import libnn.model
import libnn.vocab
# Vocab Singleton
# =====================================================================
Vocab.words = ["foo", "bar"]
vocab/singleton.constructor = @Vocab
# Layers Scopes
# =====================================================================
layer1/TimesTwoLayer.name = "times_two"
layer1/TimesTwoLayer.vocab = @vocab/singleton()
layer2/PlusOneLayer.name = "plus_one"
layer2/PlusOneLayer.vocab = @vocab/singleton()
# Model
# =====================================================================
Model.layers = [@layer1/TimesTwoLayer(), @layer2/PlusOneLayer()]
Gin’s philosophy is a little confusing at first, as it defines dependencies at the class level. If you need different objects of the same class, you need to scope the dependencies.
Gin supports a lot of functionality that you will probably need
- scoping (define dependencies between object types in different scopes).
- import (compose different gin files into one big pipeline).
- singletons (define one object that will be reused, in our example we need it for the shared vocabulary).
- Tensorflow and PyTorch specific functionality
Parsing a gin-file in python is straightforward
import gin
gin.parse_config_file("config.gin")
model = Model() # Argument `layers` provided by gin
The only caveat with gin is that it blurs lines between code and configuration : the gin syntax is really close to python. Also, because gin is taking care of dependencies for you, it is counter-intuitive at first. However, if you take some time to think about it, gin really provides something that python does not : easy definition of dependencies in a linear way, i.e. define how classes are supposed to be put together, and delegate the injection to gin.
With python super()
method
Not much to say here, as everything is explained by Raymon Hettinger in Super considered super! PyCon 2015 talk.
A Machine Learning Perspective
In Machine Learning, more especially in Deep Learning, the code is usually built around the following abstractions
- dataset
- preprocessing
- layers
- models
- optimizers
- initializers
- losses
Defining a model is usually just chaining layers, defining a loss, an optimizer, and combining everything into one function call. More importantly, as an empiric field, being able to quickly test different configurations is key.
Interestingly, it seems that the relatively young world of Deep Learning libraries is converging to a common approach.
- the ability to define pipelines in simple
.jsonnet
files provided by AllenNLP was key to its success. The implementation of this feature is actually very close to thefrom_params
approach covered in the example of this article. - similarly, the team behind SpaCy released a great package to help train neural networks for NLP on top of Tensorflow, PyTorch and JAX : thinc. It builds on top of
.ini
configuration files, in a similar manner togin
. - more recently, as we see a burst of new packages (Trax, Flax, Haiku) in the Deep Learning world motivated by the growth in popularity of JAX (NumPy accelerated on GPU, with higher order differentiation, proper control over random generators, XLA support and a few other tricks), I was glad to see that Google Brain’s Trax was actually using
gin
as a configuration language.
Conclusion
Here are the main takeaways
- Python and Machine Learning are not an excuse to build poorly designed libraries. We should read about design patterns, and try to follow the Single Responsibility and Separation of Concerns principles. It will make the code more modular
- Separate implementation from usage, i.e. build libraries that allow users to do complex things with little code.
- Usage is equivalent to configuration, and configuration boils down to dependency injection.
- While simple python scripts are a great way to define configs, you might want to build some custom
.json
(or any other format that suits your needs) interface for ease-of-use, or switch togin
(or any other dependency injection package) for out-of-the-box functionality.