underlap

Writing better software design documents

April 26, 2024

How does anybody get good in writing software design docs? The more I try, the less I feel confident 🙄

To which I replied:

Find and read good design docs.
Ask people to provide feedback on the design docs you write.
Search the web for checklists of topics to include.
Add your own topics relevant to the area you work in.
Work on your general writing skills, e.g. read “The Elements of Style” by Strunk and White.
If you understand an aspect of the design clearly, write it up so others can understand it too. If you don't understand an aspect, write down a question about it and try to find the answer.

What makes a good design document?

I think the most important purpose of a good design document is to define terms clearly and describe the mental model of how the software fits together. Capturing rationale and alternatives considered is like gold dust and a great gift to one's future self. (It's really hard to remember that kind of stuff years later.)

Writing introductory material and defining terms early on has the advantage that one's viewpoint is closer to that of a beginner. That doesn't last long! Also, it's only when you write something down that you realise you don't understand it quite as well as you thought.

Resisting the urge to start coding until you've got a reasonably clear understanding of what's needed not only saves time overall, but captures information useful to others.

Someone with no previous exposure to the piece of software should be able to understand the terminology and start to find their way around the code. The design doc might not get them immediately into the code, but they should be able to start browsing and searching the code and find some of the terms introduced in the design doc so they can begin to grok the code.

Generally, over the years my design docs have become shorter. I don't include detail which should be code or code comments as that will tend to go out of date quickly.

Topics to cover

The topics to cover in a design document vary, depending on the project, but here are some examples of the kinds of headings you might see:

Requirements
Scope
Out of scope
Architecture
Interfaces
Major dependencies
Operating systems supported
Hardware platforms supported
Configuration
Security
Performance
Compatibility
Interoperabiity
Internationalisation
Alternatives considered

Don't design documents go out of date?

I think there's some value in keeping old architecture or design docs without necessarily keeping them up to date. The original rationale is valuable information and doesn't tend to go out of date. They just need to be treated as historical documents.

There is some value in capturing tests as a way of documenting the design that doesn't go out of date. Tests, at least partially, answer the question “What does this code do?”, but they don't necessarily help answer the question “Why was this code written?”, something that a design document can describe and which will continue to remain true because it's a description of the past.

Example design documents

The following are some examples of hopefully good design documents. These are mostly from projects I've been directly involved in.

Each example links to the design document and provides a graphic to give a flavour of what's there.

If you know of excellent examples I could add, please let me know.

A note on Z specifications

Please note that some of the following design documents are Z specifications. These contain a mixture of English, which should be fairly readable on its own, and formal, mathematical models that describe the state and operations of parts of the system being modelled.

The Z specifications were intended for anyone with sufficient mathematics to follow them, but primarily for myself as an aid to understanding. Each Z specification includes an introduction to the Z notation as an appendix, which should be sufficient to give anyone with sufficient mathematics a reasonable clue as to what is going on.

1. Diagrams: Eclipse Virgo concepts

I wrote Virgo Concepts some years back. As well as introducing the main concepts behind the Virgo modular application server, the page serves as the introduction to some other design docs about Eclipse Virgo.

The page was intended to be read by (potential) contributors to Virgo.

2. Apache Tomcat 9 Architecture

The Apache Tomcat 9 Architecture site defines key terms in an overview and then describes server startup and request process flow using UML sequence diagrams.

3. Image Relocation: README

The design docs here consist of a README as well as some Z specifications (see below).

The README was intended for users of the image relocation library.

4. Image Relocation: formal specifications

I wrote these specs, in a mixture of English and the Z specification language, as a way of getting my head round Docker/OCI images and registries.

(Click here to enlarge.)

5. UML: jvmkill Developers' Guide

The jvmkill Developers' Guide included a sequence diagram to explain the interaction between the Java Virtual Machine and the jvmkill agent.

6. An Abstract Model of Linux Control Groups and Subsystems

Steve Powell and I wrote this formal specification of cgroups and subsystems when we were trying to understand those Linux features when preparing to work on the garden project.

Here's a snippet showing one of the Z schemas:

[ | about | blogroll | contact | now | search | subscribe | © | ]

Why write unit tests? A response

April 23, 2024

I really enjoyed Henrik Jernevad's post Why write unit tests? and agree with most of what he wrote. But I'd like to respond briefly⁰ to some of the points raised.

Writing tests to drive design (TDD)

I really like the “test first” aspect of TDD (see my post on this topic), although I don't practise it strictly, especially on personal projects. But the principle of writing (unit) tests early means that my code ends up being testable rather than untestable — definitely a good thing.

If you really want to get into TDD, a pretty good introductory book is “Growing Object-Oriented Software Guided by Tests” by Steve Freeman and Nat Pryce.

That said, after reading the book, I was a little disappointed. I've spent a lot of time over my career designing¹ software, especially the interfaces between components, and the book didn't really make me better at doing that. TDD avoids certain kinds of design error, but not the kind I am accustomed to making.

Rich Hickey put it this way in one of his talks². He compared tests to crash barriers (or “guard rails”, since he's American) on motorways (“freeways”). They are good for safety, but you don't find out where you want to go by continually bumping into them.

In other words, TDD sets a minimum standard for design, but not a very high one.

Unit testing and code structure

The Kent Beck quote was interesting:

Tests should be coupled to the behavior of code and decoupled from the structure of code.

This sounds fine in principle, but in practice there is a delicate trade-off between unit testing code thoroughly and coupling tests to code structure. Let me explain.

Suppose a piece of code has several error paths which are difficult to exercise in integration tests. The usual way of unit testing such code is to isolate it and then inject errors using stubs or mocks in place of dependencies. This usually means isolating a relatively small piece of code so that the error outputs can be tested.

But often this means that the unit tests that drive the error paths depend on the structure of the code. If the code needs refactoring, those tests will need reworking too.

I haven't seen a good solution to this problem.

Deleting ineffective tests

I once worked on a project where the velocity went down considerably because of the cost of maintaining the test infrastructure. I tried arguing that many of the tests were unnecessary because they had never detected any problems. The counter-argument, which I couldn't refute, was that deleting some tests reduced the safety net and made it easier to introduce bugs.

I guess the one situation when tests can safely be deleted is where they duplicate other tests.

#SoftwareDevelopment #SoftwareTesting

Footnotes:

⁰ I think the IndieWeb should foster conversation and debate. We need to read each other's posts and respond to them rather than writing in a vacuum.

¹ When I use the term “design”, I'm really talking about the internal structure or architecture of code. I don't use the term as it's often used: to refer mainly to the user interface.

² “Simple made easy” — probably the best programming talk I've ever come across.

[ | about | blogroll | contact | now | search | subscribe | © | ]

Dot Social: Threads Has Entered the Fediverse

April 22, 2024

In a fascinating ninth episode of Dot Social, Mike McCue chatted to two Meta employees involved in the development of Threads: Rachel Lambert¹ (@rklambo@threads.net), Director of Product Management, and Peter Cottle (@pcottle@threads.net ), Software Engineer.

I would recommend listening to the full episode, but here are some highlights from my perspective.

What is Threads?

Rachel referred to Threads more than once as an app. I find this surprising. At the very least, Threads consists of a server, a (proprietary) protocol, as well as an app. (Later in the episode, she mentioned “pixel-level” compatibility, so perhaps she has experience in front-end development rather than servers and protocols.)

Fediverse integration is incomplete

Both the host and the guests referred to fediverse integration by Threads in the past tense. This feels like hype. They all acknowledged the integration was incomplete. I think it would be more helpful to refer to it as ongoing until it really is (almost) complete.

ActivityPub over AT protocol

Meta chose ActivityPub instead of AT protocol for federation because ActivityPub comes from a standards organisation and has been used (“battle-tested”) for ten years.

The value of federation

Meta sees federation as adding value to Threads and agreed with Mike's comparison to what happened when large companies started adopting web standards.

Milestones

Rachel outlined the major milestones in federating with the fediverse:

Milestone 1: (complete) Allowing Threads users to publish to the fediverse and receive feedback in the form of aggregated likes.

Milestone 2: Allowing Threads users see replies from the fediverse. This is the first step towards bidirectional support.

Milestone 3: Allowing Threads users to follow fediverse users and receive fediverse posts.

Hopefully, seeing Meta deliver these milestones will help to build confidence in Meta's intentions and alleviate concerns that Threads might suddenly drop federation support.

Federation foundational?

Mike referred to federation with the fediverse as being foundational to Threads. Peter reflected this back in terms of Meta's investment in federation support.

It wasn't clear to me that Threads is in any way being re-based on Activity pub, which is what I think would make federation foundational to Threads. It may be more accurate to say that federation is foundational to Threads' current marketing and positioning.

Fediforum

Peter's participation in the March 2024 Fediforum was mentioned. The following sessions mentioned Threads:

Threads in the fediverse: a quick demo, recorded on YouTube and PeerTube.
Politics of Threads joining the Fediverse and surveillance capitalism: summary and discussion notes (Note: I didn't see any threads.net handles in the list of participants.)
Threads as a gateway to mass adoption of open social web: brief summary and notes

Footnote:

¹ Unless I'm mistaken, Rachel is the first female participant on Dot Social. This is welcome. It would be great to include female engineers too.

[ | about | blogroll | contact | now | search | subscribe | © | ]

Blocking AI web crawlers

April 21, 2024

(It's not déjà vu: this post is an amalgamation of two earlier posts on this subject.)

AI companies crawl websites to train their large language models (LLMs). Aside from using copyrighted material that is not their's to copy, there is the huge environmental cost¹ in the training and use of LLMs.

So a growing number of owners of personal websites, including me, do not want to be party to this activity. We want to stop AI web crawlers from accessing out websites. But how can we do that?

The obvious first answer is to install a robots.txt file on the website. This lists the “user agents” and the URL paths that each agent is or is not allowed to access. Each web crawler has a user agent. For example, GPT uses GPTBot.

The rest of this post questions whether robots.txt is sufficient and suggests option for actually blocking AI web crawlers.

Is `robots.txt` sufficient?

Background

Three weeks ago I added a robots.txt to this site, intending to dissuade AI web crawlers from scanning my posts. I am grateful to Cory Dransfeldt for his post Go ahead and block AI web crawlers and for his subsequent creation of the ai.robots.txt git repository.

There was a reasonable response on Hacker News. Many wondered if AI web crawlers would respect robots.txt.

Will AI web crawlers respect `robots.txt`?

Perhaps not. But larger and/or reputable companies developing AI models probably wouldn't want to damage their reputation by ignoring robots.txt. Given the contentious nature of AI and the possibility of legislation limiting its development, companies developing AI models will probably want to be seen to be behaving ethically.

`robots.txt` and its limitations

According to Google's Introduction to robots.txt, the purpose of the file is to prevent a web site being overwhelmed by (search engine) crawler traffic and to prevent unimportant pages from being indexed. It won't guarantee that pages do not appear in a search index since a page may be indexed when it is referenced from another site. (Thus it seems Google doesn't check robots.txt before indexing such a page.)

What's the difference between search engine crawlers and AI crawlers?

AI crawlers won't necessarily follow links, whereas search engine crawlers typically will. The purpose of AI crawlers is to gather suitable training data for AI models, so they are likely to be more selective about the kinds of sites they crawl.

What about `ai.txt`?

The Spawning company is proposing an ai.txt file as a way of communicating the media types that a site does and does not offer to AI crawlers. At the time of writing, this file is Spawning-specific and would benefit from standardisation and adoption by the larger AI companies. (Another option would be to add support for AI crawling restrictions to the robots.txt format.)

So is it worth trying to block AI web crawlers?

I think so. If we don't make any attempt to block them, we're effectively inviting them in.

What to put in `robots.txt`?

There is a growing number of sources of suitable robots.txt files, including ai.robots.txt and Dark Visitors. All you need to do is arrange for your website to serve your chosen robots.txt file from the path /robots.txt.

Job done?

Well not quite. Unfortunately, respecting robots.txt requires effort on the part of the developers of web crawlers and not all of them may have implemented this support. (If you have evidence of AI web crawlers ignoring robots.txt, please get in touch! This is the closest I've found to a smoking gun.)

So if we want more certainty that AI web crawlers are not scraping our websites, we need to do better. Websites need to block AI web crawlers.

Implementing blocking

I've seen a number of approaches such as Apache or NGINX configurations which reject requests from certain user agents. Someone has used Cloudflare to block AI crawlers. There's even an NGINX module that rejects web crawlers which don't implement certain browser features properly.

Since my site uses NGINX, I wrote the nginx_robot_access module to enforce robots.txt. If you'd like to use it, see the instructions in the README. If you want to read about the development of the module, see #BlockingAI.

If you'd like to use the module, but are not currently using NGINX, you may be able to add NGINX as a reverse proxy (“in front of”) your existing web server — see the links in Abstracting cloud storage.

NGINX reverse proxy for web server

Seeing `nginx_robot_access` in action

If you want to see the module in action, issue the following:

curl -A "CCBot" https://underlap.org/banana.html

but please make sure to include banana.html so I don't mistake this for a smoking gun if I check the site's access log!

It should produce the following output:

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.22.1</center>
</body>
</html>

Comments

Henrik Jernevad wondered:

if crawlers/bots will not respect robots.txt, will they provide an honest user agent?

I think they will. Not to do so would be downright malicious or dishonest and that would detract from their reputation (see my earlier comments on AI web crawlers respecting robots.txt).

#BlockingAI #WriteFreely

Footnote:

¹ The carbon cost of training GPT-4 has been estimated at 6912 metric tonnes of CO₂ equivalent emissions. Even if this is an over-estimate, it shows the scale of the problem.

[ | about | blogroll | contact | now | search | subscribe | © | ]

NGINX robot access

April 16, 2024

The project I started earlier is public. The code is very rough, but it may be of interest to others and, who knows, someone may care to lend a hand. So what will it do?

nginx_robot_access is an NGINX module written in Rust that will enforce the rules in robots.txt. The aim is to provide some protection against AI web crawlers which choose to disregard those rules.

Dependencies

Firstly, I didn't really want to write an NGINX module in C (with which I'm much rustier than with Rust). I was pleased to find ngx-rust, a Rust binding to NGINX which comes with various example modules.

Secondly, I imagined having to parse robots.txt and implement an access control function based on its content. But after a brief search, I was delighted to find the Rust crate robotstxt. It does more than the crate I was thinking I'd have to write from scratch and it is understandable and properly tested. Kudos to its author, Shuang Zhu (a.k.a. Folyd on GitHub and Twitter), who ported Google's original C++ version to Rust.

What's there so far?

Very little actually. I started with the curl example from ngx-rust. I had difficulty building the example code against the published (v0.2.1) nginx-sys crate (provided by and needed in addition to ngx-rust).

However, the example built and ran just fine as part of ngx-rust. I got a clean build of the example code copied to my repository with a dependency on the current latest code in ngx-rust. Phew!

So to get a repeatable build (rather than depending on a specific hash, in case the history of the upstream repository is rewritten), I forked the repo and created a pre-0.5.0 tag to reflect the version that worked for me. Then I changed the dependency of my project to refer to the fork:

[dependencies]
ngx = { git = "https://github.com/glyn/ngx-rust",tag="pre-0.5.0"}
...

I logged an issue to upgrade this to dependencies on the ngx-rust and ngx-sys crates when later versions of these are published.

Postscript: the module now works, although it has few tests and could perhaps perform better.

An unexpected bug

I hard-coded the file path of robots.txt and then added some debug logging, including the following:

let mut matcher = DefaultMatcher::default();

ngx_log_debug_http!(request, "matching user agent {} and path {} against robots.txt contents: \n{}", ua, path, co.robots_txt_contents);

let allowed = matcher.one_agent_allowed_by_robots(&co.robots_txt_contents, ua, path);

if allowed {
    ngx_log_debug_http!(request, "robots.txt allowed");
    ...

The debug logs showed the correct string inputs to one_agent_allowed_by_robots:

2024/04/16 20:36:22 [debug] 2514#2514: *1 matching user agent curl/8.7.1 and path /index.html against robots.txt contents: 
User-agent: *
Disallow:

...

User-agent: curl
Disallow: /

User-agent: curl
Allow: /banana.html

...

2024/04/16 20:36:22 [debug] 2514#2514: *1 robots.txt allowed

but this is not the result I expected.

Running robotstxt from the command line with the above robots.txt reproduced the unexpected behaviour:

cargo run robots.txt curl/8.7.1 /index.html
     Running `target/debug/robotstxt robots.txt curl/8.7.1 /index.html`
user-agent 'curl/8.7.1' with URI '/index.html': ALLOWED

whereas dropping the version number from the user agent input produced the expected behaviour:

cargo run robots.txt curl /index.html
     Running `target/debug/robotstxt robots.txt curl /index.html`
user-agent 'curl' with URI '/index.html': DISALLOWED

I even tried adding a unit test for the code that strips the trailing part off the user agent:

assert_eq!("curl", Target::extract_user_agent("curl/8.7.1"));

and this passed. So it seemed as if extract_user_agent was not being called in the failing case.

That's exactly what was happening.

The issue was that robotstxt is a Rust port of code developed by Google — it's intended to be used by a web crawler rather than a web server. As such, it expects the user agent input parameter to be provided without a trailing slash and version.

Rather than try to change the scope of robotstxt, the solution I chose was to sanitise the user agent value by removing any trailing slash and version before passing the sanitised value to the robotstxt matcher.

(Being lazy, I copied the private extract_user_agent function and its unit tests from robotstxt. This is legitimate since both projects use the Apache v2.0 license.)

Postscript

Since writing this post I spent a few hours working out how to configure the module.

I based my module on the curl example provided with ngx-rust. This example had one configuration item, named curl which was used to enable or disable the module. The configuration item goes in the NGINX location configuration block:

http {
    server {
        ...
        location / {
            ...
            # libcurl module directive:
            curl on;
        }
        ...
    }
}

The start of the curl example module gave an important clue about the interface it is implementing:

struct Module;

impl http::HTTPModule for Module {

HTTPModule starts like this:

/// The `HTTPModule` trait provides the NGINX configuration stage interface.
///
/// These functions allocate structures, initialize them, and merge through
/// the configuration layers.
///
/// See https://nginx.org/en/docs/dev/development_guide.html#adding_new_modules
/// for details.
pub trait HTTPModule {
    ...

It turns out there are functions to initialise a module's configuration and then to merge the configuration at various levels.

The following diagram shows the placement of various kinds of configuration and the merging process:

NGINX configuration merging

The NGINX robot access module uses a configuration directive to specify the absolute file path of robots.txt. This directive may be specified in any of the http, server, or location configuration blocks. The merging process ensures that:

Configuring the directive in the location block overrides any configuration of the directive in the server block.
Configuring the directive in the server block overrides any configuration in the http block.

#BlockingAI

[ | about | blogroll | contact | now | search | subscribe | © | ]

Starting a new programming project

April 16, 2024

A couple of days ago, I had an idea for a useful utility and so today I thought I'd have a go at starting to code it. It felt like chiselling granite. Why should that be?

Regaining proficiency

During my programming career, I experienced a few periods when I didn't do a lot of coding. Doing standards work sometimes involved intensive work on a spec rather than coding. Working in a senior staff role sometimes involved planning, consultancy, and other non-coding activities.

Each time I came back to programming after a few months gap, I felt decidedly rusty and it took a week or so to become proficient again. This isn't helped by my relatively poor memory. (For example, when I worked in Java, I often had to look up the syntax for a main method.)

Motivating myself

There's some inertia in starting any new project. Have I understood the requirements sufficiently? Is the likely effort going to be too much? Will I run out of enthusiasm before getting something working?

I also want to investigate any alternatives to make sure I'm not reinventing something or overlooking a simpler solution.

An additional aspect is that doing a personal project doesn't carry with it the accountability and commitment of taking on some coding as part of a team.

Once I've made the project public, that is likely to change as I'll be making myself accountable to any observers and potential users or collaborators.

Even then, I need to discipline myself to schedule decent chunks of time and prioritise keeping chipping away at the project until I make reasonable progress.

Continual improvement

Over the years, I've picked up many good practices. It's hard to just hack something together, especially when I am planning on making my code public. In a sense, each project I work on has to be at least as good as earlier projects.

Project set-up

Even when I'm at peak coding proficiency, starting a new project is demanding as there are lots of design decisions and other choices to make.

This is especially the case for a project which seems small as even small projects require a certain irreducible amount of set-up. Also a small project often ends up being a bit more complex than I initially imagined.

Grinding through the set-up stage of a project can be tedious. There's often little to show for getting things off to a good start:

Starting a README
Setting up a git repository
Settling on a suitable license
Creating a skeletal set of code and build files — even when this is semi-automatic, choices need making
Learning any essential dependencies — this steps often takes longer than I would expect, especially when I encounter bugs or limitations in dependencies
Refamiliarising myself with the programming language chosen for the project.

Aiming for reuse

I don't find it particularly satisfying to simply throw together some code to get a job done. I want to structure the code into parts that could be reused and code which glues those parts together. If code is going to be reuseable, it needs to have a clean interface and some decent documentation.

This might sound like pure overhead for a personal project, but I find that separating the code in this way and clarifying and documenting interfaces helps me understand what I'm doing much better, so it's not something I particularly want to skip.

Conclusion

Writing this post has been a useful reminder. Producing a useful program is bound to be a non-trivial task and I need to bear that in mind.

There is an irreducible cost of starting any new project and there is also the cost of getting back into the groove of coding.

But I really enjoy programming and producing useful code, so I just need to keep going.

Postscript: see NGINX robot access for details of this new project or #BlockingAI for more on the subject.

#SoftwareDevelopment

About anonymous blogging

April 15, 2024

This is my response to Kev Quirk's post What About Anonymous Blogging?.

I completely understand why some people prefer to use a pseudonym when blogging. If they've experienced online abuse or have stuff going on in real life that they want to blog about, they might prefer the psychological safety of not blogging under their real name.

I've worked in open source and open standards for many years and am used to sticking my head above the parapet. I am fortunate not to have experienced anything that would put me off continuing to use my real name and I don't particularly feel the need to blog about my personal life.

One positive aspect of using my real name for blogging is that I feel more personal responsibility for anything that I post. I hold views that others may disagree with — who doesn't? — but I want to be responsible for the way I interact with others rather than potentially “hiding” behind a pseudonym.

As Kev points out, it's usually possible to find someone's real name if you really want to, so using a pseudonym might also give me a false sense of security.

Having said all that, I like using the “underlap” handle as that seems friendlier when interacting with others, e.g. on IRC. I am lucky to have a relatively unusual first name, so I grabbed “glyn” on GitHub, and I sometimes use that as a handle in other contexts too.

“I just wanna link the thing”

April 13, 2024

I came across this statement in a post by Erlend Sogge Heggen where he contrasts writing a review on Bookwrm or similar and just making a link available as a way of encapsulating a resource recommendation. The latter is the direction being taken by the federated website aggregator linkblocks.

Now it may be less work for a collector of links to simply add their link to a collection, perhaps with some tags. But what about the reader of the collection?

Doesn't the reader need to understand why someone is recommending a link? Probably.

Do they have to follow the link to decide whether the resource is worth checking out? I hope not.

Does a particular person recommending the link mean that the linked resource is necessarily worth spending time on? I doubt it.

So, although I love the concept of a federated website aggregator, I'd like there to be the option of adding commentary in the form of site reviews, comparisons between sites, overviews of collections, and such like.

It could be as simple as supporting the inclusion of some (for example, markdown or HTML) text per collection and per link. And clearly this is an extension of the basic function of an aggregator and doesn't detract from the aggregation function, so it should be easy to add as a feature once the basic support is done.

#WebsiteAggregation #LinkBlocks

Single user instances

April 10, 2024

I've recently read several posts where the author says they like being on a single user Mastodon instance.

They feel uninhibited about posting what they really think.

They own all their content.

They never disagree with moderation decisions, because they are the moderator. No moderator can censor them.

(The same may also be true for multi-user instances where the person in question is a sole/dominant moderator.)

There are some practical downsides of single user instances, such as: replies don't necessarily show up, “likes” can go AWOL, it's hard to discover new users to follow, and the trending feature doesn't work.

But a crucial downside from my perspective is that a single user instance is the ultimate echo-chamber.

No-one will correct me if I get out of hand. If I'm very lucky, someone will reply with constructive feedback or report me to the moderator (me!).

My domain name or VPS provider is unlikely to take action against me unless I do something particularly aggregious or illegal.

So an advantage of sticking with a multi-user instance, moderated by others, is that I benefit from corrective behaviour by others.

I also get to see more posts of people I don't follow, including those I might disagree with.

Do I prefer an echo-chamber that amplifies my biases? Or do I prefer reasonable debate in the public square?

(If I was in a persecuted minority, I might have little choice, but I'm not in such a minority.)

How about you?

Process algebra and JSONPath descendant segment semantics

April 5, 2024

One question the JSONPath Working Group had to wrestle with in specifying RFC 9535 was how much non-determinism to allow in the semantics of the descendant segment (..[<selectors>]). There was a spectrum of possibilities:

Specify the ordering fully (i.e. make the behaviour completely deterministic) — this would have been excessive given the unordered nature of JSON objects
Specify some determinism in the ordering — this was where we ended up
Specify no ordering at all (i.e. make the behaviour completely non-deterministic) — this would have made testing unnecessarily difficult.

For example, see this issue which increased the non-determinism of the behaviour.

In this post, I'll use some light-touch formal methods to describe the non-determinism more precisely than is possible with English alone.

Algebraic properties

It's helpful to use some algebra — process algebra — to describe properties of the non-deterministic processes we are considering.

Tony Hoare's Communicating Sequential Processes (CSP) can be used to model processes. The processes here do not exhibit internal divergence (unlimited internal “chatter”), so we need only be concerned with the traces and failures of processes.

Traces are sequences of external events — in our case the value input to a query followed by the nodes of the nodelist resulting from the query.

Failures are cases where the process terminates (or, equivalently, deadlocks) after producing a finite trace.

Traces and failures together can be used to define a refinement relation (⊑) between CSP processes.

With that thumbnail sketch of the theory out of the way, we can describe the refinement relationships between the processes in question.

Let's start with processes corresponding to the spectrum of non-deterministic semantics of ..[<selectors>] (for a given list of selectors, such as *):

S₀ is where the ordering is specified fully
S₁ is where some determinism in the ordering is specified — specifically, the semantics of RFC 9535
S₂ is where no ordering at all is specified.

Then:

S₀ ⊑ S₁ ⊑ S₂

Furthermore, any valid implementation, I, of the RFC is a refinement of the spec, S₁:

I ⊑ S₁

Note that the Haskell simulation, Sim, of the semantics of ..[*] is equivalent to (refines and is refined by) the spec:

Sim ⊑ S₁
S₁ ⊑ Sim

So far, so good. One wrinkle is that RFC 9535 defines a non-deterministic node-and-descendants traversal order, T, but the result of ..[*], S₁, is a strict refinement (⊏) of that traversal order:

S₁ ⊏ T

Another way of putting this is that S₁ is a refinement of T (S₁ ⊑ T), but T is not a refinement of S₁ (not T ⊑ S₁). Therefore T and S₁ are not equivalent processes.

In English, ..[*] is strictly less non-deterministic than the node-and-descendants traversal order defined by the RFC. A couple of examples will make that clearer.

A deterministic example

Let's consider $..[*] applied to [{"j": 4}, {"k": 6}]. The tree structure of this document is:

(Please don't ask me why trees in computing are often upside-down.)

The spec requires that array elements are visited in order so {"j": 4} must be visited before {"k": 6}.

The spec also requires that parents are visited before their children, so: * [{"j": 4}, {"k": 6}] is visited before {"j": 4} and {"k": 6} * {"j": 4} is visited before 4 * {"k": 6} is visited before 6

So one valid node-plus-descendants traversal order is the following:

D1: [{"j": 4}, {"k": 6}]
D2: {"j": 4}
D3: 4
D4: {"k": 6}
D5: 6

(See Section 2.5.2.2 of the RFC for the meaning of Dn, used above, and Rn, used below.)

The child segment [*] needs to be applied to all the visited nodes. Let's see what happens to each node (regardless of the order in which the nodes are visited!) when [*] is applied:

[{"j": 4}, {"k": 6}] produces the nodelist R1: {"j": 4}, {"k": 6} (To avoid confusion, I'm not writing this as a JSON array.)
{"j": 4} produces the nodelist R2: 4
4 produces the empty nodelist R3
{"k": 6} produces the nodelist R4: 6
6 produces the empty nodelist R5

So the resultant nodelist is the concatenation of Ri (i=1..5): {"j": 4}, {"k": 6}, 4, 6.

It turns out that for any valid traversal order of the descendants, $..[*] applied to [{"j": 4}, {"k": 6}] produces the nodelist {"j": 4}, {"k": 6}, 4, 6. This is because the valid traversal orders only change the positions of 4 and 6 and these don't affect the outcome since they each produce the empty nodelist.

Non-deterministic example

Let's consider ..[*] applied to a carefully chosen nested collection of arrays [[[1]],[2]]. The tree structure of this document is:

The spec requires that array elements are visited in order so [[1]] must be visited before [2].

The spec also requires that parents are visited before their children, so: * [[[1]],[2]] is visited before [[1]] and [2] * [[1]] is visited before [1] * [1] is visited before 1 * [2] is visited before 2

So there are six valid node-plus-descendants traversal orders: * [[[1]],[2]], [[1]], [1], 1, [2], 2 * [[[1]],[2]], [[1]], [1], [2], 1, 2 * [[[1]],[2]], [[1]], [1], [2], 2, 1 * [[[1]],[2]], [[1]], [2], [1], 1, 2 * [[[1]],[2]], [[1]], [2], [1], 2, 1 * [[[1]],[2]], [[1]], [2], 2, [1], 1

Applying the child segment [*] to each of these orders produces the following two possible resultant nodelists: * [[1]], [2], [1], 1, 2 * [[1]], [2], [1], 2, 1

Conclusion

Both the above examples demonstrate the non-determinism of the traversal ordering and the lesser non-determinism of the descendant segment. Applying the child selector(s) to the traversed nodes reduces the non-determinism and makes the spec easier to test against.

This post also hopefully demonstrates that applying formal methods with a light touch can add precision to descriptions of software semantics.

#IETF #SoftwareStandards #ProcessAlgebra #FormalMethods

What makes a good design document?

Topics to cover

Don't design documents go out of date?

Example design documents

A note on Z specifications

1. Diagrams: Eclipse Virgo concepts

2. Apache Tomcat 9 Architecture

3. Image Relocation: README

4. Image Relocation: formal specifications

5. UML: jvmkill Developers' Guide

6. An Abstract Model of Linux Control Groups and Subsystems

Writing tests to drive design (TDD)

Unit testing and code structure

Deleting ineffective tests

What is Threads?

Fediverse integration is incomplete

ActivityPub over AT protocol

The value of federation

Milestones

Federation foundational?

Fediforum

Is robots.txt sufficient?

Background

Will AI web crawlers respect robots.txt?

robots.txt and its limitations

What's the difference between search engine crawlers and AI crawlers?

What about ai.txt?

So is it worth trying to block AI web crawlers?

What to put in robots.txt?

Job done?

Implementing blocking

Seeing nginx_robot_access in action

Further reading

Comments

Dependencies

What's there so far?

An unexpected bug

Postscript

Regaining proficiency

Motivating myself

Continual improvement

Project set-up

Aiming for reuse

Conclusion

Algebraic properties

A deterministic example

Non-deterministic example

Conclusion

Is `robots.txt` sufficient?

Will AI web crawlers respect `robots.txt`?

`robots.txt` and its limitations

What about `ai.txt`?

What to put in `robots.txt`?

Seeing `nginx_robot_access` in action