A while back I wrote a piece asking how we begin to think about using data to move forward with standardization, and called for ways to help get data. One thing I did was request a new query from the HTTPArchive including data on “dasherized elements”. Keep in mind that the while the top 1.2 million sites or so in this dataset a lot of data, it is still a small sampling and has its own biases. It reports mostly on a particular ‘kind’ of site which is not representative of the giant bottom of the iceberg that lives beneath the surface, inside of corporate intranets, behind logins and paywalls and so on. Ultimately, we need more - but you have to start somewhere.
Yesterday, Simon Pieters answered with this tweet linking to an HTTPArchive post and yielding this dataset which is amazing.
It’s still a little hard to track because we can’t tell whether that is one page that includes an element a bunch of times, or many pages that include them, but this is an awesome start!
It’s a little hard to view that dataset and, while the attributes are awesome in also helping us know more about what that element is, but it also means some noise and that the counts are slightly confused, so I took that, ran it through some processing and created a few other views (linked where appropriate below)…
Here’s some preliminary, interesting observations:
Even from this small sample, the HTTPArchive query that reports on use of HTML elements searches for only 140 known specific elements that are in a standard, but this report shows over 24k different "dasherized tags that appear in the top 1.2 million pages. Wow! What this tells me is that there are a lot of dasherized tags in use.
It important to note that that doesn’t mean these are “custom elements” proper, but it also doesn’t really matter: What we care about really, is what you were trying to say there, semantically.
Of these, there are 3,227 different unique prefixes. These may or may not indicate common authors, but they might at least be a helpful way to look for popular ‘sets’ of elements. For example, it’s unsuprising to see the
amp- prefix in there
given all of the boosts that it gets, and it’s nice to see them all linked in and counted there. I’ve organized a json output that looks like this
To break them down into some further semi-arbitrary groups for summary:
- ~7.8k of these occur between 1 and 100 times.
- 31 of these occur between 101-200 times
- 14 occur between 200-500 times
- 4 occur between 500-1000 times
- 4 occur > 500 times
One personal note: I’m kind of sad to see that the most popular one is
amp-auto-ads occuring a whopping 3718 times and it’s not remotely the only thing that would appear to be about ads. In fact,
amp-ad also occurs 395 times and there are many
other non-amp elements that appear to be ad related. But... I guess the web has a lot of ads. Who knew.
More importantly, it’s interesting to look at this file from the bottom up (or the grouped one) though and think about whether we can identify the possible sources of these, or ‘tag’ them according to common purposes somehow. I’d like to think about how we could get this into a format thatIf you feel like you’re potentially interested in digging in and helping think about this, identifying where some of those come from, what their purpose is, etc, getting that data into a a place where we can do that kind of stuff better – whatever, feel free to leave comments on any of these gists or cc me (@briankardell) on twitter.