Posted on

Dependency resolution in packaging systems in general is a messy subject.

As a rule of thumb, it's good if libraries specify their dependencies as loosely as possible, so that the downstream libraries or applications that use them have flexibility to choose a version that will work with other dependencies.

Applications, however, probably want to specify their dependencies as strictly as possible, to ensure build reproducibility.

But this leads to an interesting constraint-solving problem. You have a set of dependencies for your application and for each dependency, you have a set of constraints. If the problem is over-constrained (i.e. library A requires library C version 1.0, but library B also requires library C, but version 2.0) the dependency resolution fails! If the problem is under-constrained, you have many valid versions to consider. And of these many versions, some might have their dependencies that need to fit with the rest of the application's dependencies.

To compound this problem, the dependency hierarchy forms a tree, and while all the dependencies must fit together, you're "discovering" the full set of dependencies only during resolution.

The dependency resolution is a hard problem and is in fact NP Complete.

So the thing barely works as-is and any delays along the hot path have the potential to become very slow for any sizeable project.

I have run into an example of this recently in Poetry. It is pretty well documented in their F.A.Q.

Quote from Poetry FAQ:

This is due to the fact that not all libraries on PyPI have properly declared their metadata and, as such, they are not available via the PyPI JSON API. At this point, Poetry has no choice but to download the packages and inspect them to get the necessary information. This is an expensive operation, both in bandwidth and time, which is why it seems this is a long process.

The basic setup of the problem is:

  1. At the start of dependency resolution, not all dependencies are known yet
  2. Some Python packages on PyPI don't fully declare their dependencies in the metadata

This means that as the solver descends, it needs to download all of those packages to inspect their dependencies and this can be very slow.

An example with specific libraries:

  1. Specify redshift-connector as a dependency in your Python project
  2. The redshift-connector specifies its dependency on boto3 as boto3>=1.9.201,<2.0.0. This is hundreds of valid versions, since 1.9.201 was released in August 2019, and boto3 is released almost daily.
  3. Poetry proceeds to download all of these versions to check them for their dependencies (I stopped it after 3h of doing this)

Thankfully the solution is also mentioned in their FAQ:

At the moment there is no way around it. However, if you notice that Poetry is downloading many versions of a single package, you can lessen the workload by constraining that one package in your pyproject.toml more narrowly. That way Poetry does not have to sift through so many versions of it, which may speed up the locking process considerably in some cases.

So in our specific example, the solution is to add boto3 as an explicit dependency, this will tell Poetry early on that there is only 1 version to try solving with.

To identify which (transitive) dependency is causing problems, you can run e.g.

poetry update --vvv

Once you identify the dependency, add it explicitly to your pyproject.toml:

boto3 = "1.26.137"