Making Sure Content Lives On...

In which I describe a problem, share a possible solution that you can use today, and, hopefully start a conversation...

(If you already know the problem, and you're looking for the solution, check out the documentation over on GitHub)

There's a series of problems that I experience literally all the time: Link rot and the disappearance of our history. I have written posts on blogging platforms that no longer exist. I have written guest posts for sites that no longer exist. I frequently wish that that content still existed - not only still existed, but was findable again. I have researched hard and bookmarked or linked to obscure but important history which, you guessed it: no longer exists. This is tremendously frustrating, but also just really sad (more on this later)...

This is one thing that really interests me about a potentially future decentralized web - we could do better here. At the same time, we often have to deal with the Web we have now - not the one we wish existed. So, let's look the state of this today?

Archiving today

Luckily, when content disappears, I can often turn to the Internet Archive and the Wayback Machine to find content. But the truth is that they've got a very tough challenge. Identifing all the things that need indexing is really hard. If you have a relatively small blog or site, especially, it can be hard to find. Then, even if it gets found, how do they know when you've got some new content? So, the net result is that no matter how good a job they do - a lot of really interesting stuff can wind up not being findable today there either.

History itself makes this problem even harder: It's very unpredictable. Often things are considerably more interesting in only in hindsight. Things also tend to seem a lot more stable in the short term than they are in reality in the long arc of history: Even giant "too big to fail", semi-monopolies ultimately shift, burn out or even just break their history. One of the wisest things I ever heard was an older engineer who told me if they could impart one things to students it was this - all the paradigms that you think are forever will shift. Once upon a time IBM ruled the world. Motorola was absolutely dominant. Once upon a time, SourceForge was the shit. Then GitHub. Once upon a time Netscape won the interwebs, and then they lost it to Microsoft who won it even bigger. Then Microsoft lost it again.

It's ok if you don't share my desire, but I personally, want to be sure my content continues to exist, and people can find it via old links despite all of these uncertain futures. Today, the best chance of that is in the Internet Archive. So, how can I be more confident that it does?

Maybe... Just ask?

Interestingly, a lot of these problems get hella simpler if you just ask.

For a really long time now (as long as I can remember the Wayback Machine existing), you can go to https://archive.org/web/, enter a URL under the heading 'Save page now' and click the button.

A notification model is way, way simpler than somehow trying to monitor and guess - and we use this same basic pattern for several things in engineering. It's quite success at keeping things efficient.

The trouble here is, I think, that it is currently manual. It's unlikely that I'm going to do that every time I add a new blog post, even thought that's exactly what I want to do.

Maybe we can fix that?

Asking by default...

Recently, this got me to thinking... What if it were really easy to incorporate this into whatever blog software you use today? Think about it: RSS continues to thrive, in part, because you hardly have to know about it - it's so comparatively easy (entirely free in some cases in that it is just part of the software you get). If you could incorporate this into your system in somewhere between 0 and 30 minutes, would you?

Every now and then I have a thought like this that seems so clear to me that I figure "surely this must exist? Why do I not know about it?" So, I asked around on social media. Surprisingly, I could find no one who was aware of such a thing. What about publicly exposing an API? Do they? Maybe? Kinda? Could we make it easier?

Brendan Eich cc'ed Jason Scott from the Internet Archive into the conversation. Jason cc'ed in Mark Graham, the Director for the Wayback Machine. Mark got me access to a service and set up this little experiement to try it out.

But how to get there?

One of the challenges here is how to make something 'easy' when there are so many blogging systems and site hosting things. Submitting a URL automatically, even, is technically not hard - but fitting that into each system is a little more work. Then, each of these uses different frameworks, languages, package managers and so on. I don't really have time to build out 100 different 'easy' solutions just to start a conversation. In fact, I want this for myself, and currently my setup is very custom - it's all stuff I wrote that works for me but would be more or less useful to someone else.

But then I had a thought: There's actually lots of good stuff in the Web we have for tackling this problem.

Existing, higher level hooks

HTTP is already a nice, high-level API that doesn't really care about how you built your site, but it might not be automatically obvious or convenient regarding how to plug into it. We already have lots of infrastructure too... As I said, almost everyone has an RSS feed and almost everyone has some kind of 'publish' process. Maybe I could make it easy for huge swaths of people by just tapping into some big patterns and existing machinery. What I really wanted, I thought, was just a flexible and easy to integrate service. It doesn't have to block your build, it isn't absolutely critical, it's just "hey... I just published a thing."

So, I decided to try this..

I created an HTTP service with a Cloudflare worker. Cloudflare workers are interesting because they're deployed at the edge, they sleep when youre not using them and you can process up to 100k requests per-day for free. Since they kind of spin up as needed, handling errors for this kind of thing is easy too - you aren't going to pollute the system somehow. In fact, they seem kind of ideally suited for this kind of thing.

This seems very good actually because blogs, especially personal blogs, tend to churn out content "slowly". That is, 100k requests per day seems like plenty for a few orders of magnitude more blogs than that. So... Cool.

Basically - you send an HTTP Post to this service with the content-type `application/json` and provide some stuff in the body. But what you provide can differ to make integration very easy regardless of what kind of setup you have. It allows, effectively 3 'shapes' that I've put together that should fit nicely and easily into whatever system you already have today and be easy to integrate. For my own blog, hosted on gh pages, it took ~5 minutes. If you're interested in trying it out or seeing what I came up with check out the experiment itself, try it out.