I collect news articles and blog posts, archive them, flatten them down to text and scan for tags, and index them in wikis. One trend that I’ve been noticing in the last year or so (and cursing) is the shift of page content out of HTML and into JSON that’s assembled by scripts, sometimes located on other sites. I’ve coped with that by passing the page to Python3 code that lets the page build itself and then scrapes out the content. It’s not perfect, and has the weakness that if those scripts go away, it will stop working. But at least the content is in there in the JSON.
Today I hit a page that carries this trend to its end: A page with zero HTML content and no JSON data.
<META NAME="robots" CONTENT="noindex,nofollow">
And how do I archive that? I could probably write a custom routine to fetch that resource, but it’ll probably be brittle code that will break every time the script changes, and I don’t feel like writing custom code for every site out there.
If during the last thousand seconds you have received any High-Beyond-protocol packets from “Arbitration Arts,” discard them at once. If they have been processed, then the processing site and all locally netted sites must be physically destroyed at once. We realize that this means the destruction of solar systems, but consider the alternative. You are under Transcendent attack.
– A Fire Upon the Deep
well. html won’t ever die until large-scale email programs (looking at your Outlook) read other coding languages properly or at all. Or we migrate to completely different electronic communication mediums.
but the trend you are pointing out is disconcerting.
It looks like they have your number
(edit: not calling you a robot, just saying it looks like the un-archivability of this page seems like it may be an intended feature)
Not while I’m around (and running my solo development shop in a non-tech environment, hopefully hanging on to my outdated ways until “retirement”)!
Ah, but my program is the latest version of Firefox.
That might be a mistake in a recent change. I don’t see how that won’t mess up the site for search engines.
You’re claiming HTML died but, you have HTML tags in the thing you’re quoting? seems contradictory to what you’re claiming.
I have the feeling that this is heading towards some kind of content control. Eventually those scripts will be asking “Have you paid yer dues?” before assembling the output.
P.S. It occurred to me that pages will always have to be search engine readable if they want any ranking, listed on Google News, etc, so I checked out the AMP version of news articles. Same deal there.
I suspect that this will be another one for the “Stallman’s position was dismissed as absurd neckbeard paranoia about something that wouldn’t happen until it happened hard enough to be considered commonplace” files.
Those are more crowded than one would like…
RMS is the Walter Sobchek of coding. He’s not wrong, just an asshole.
We’ll see. AMP is already deferring a lot of the DOM generation to scripts, where the HTML just defines the arguments passed to the script, and Apple News takes that a step further by wrapping the news articles as a JSON object.
I’m not actually against this, though. One of the issues with a long-running site like BB is what do we do with old posts? Many have old, antiquated HTML formatting that won’t work with new designs (this, BTW, is the very reason that the BB display format has been unchanged for so long). Avoiding that by creating a text “content block” and leaving the rest to dynamic generation makes a lot of sense from that perspective.
Where it gets complicated from an archival perspective is that “content block” isn’t universal, so every site has a different idea of what that is. Paradoxically that’s also what the dominance of large platform helps alleviate, though - sites become easier to archive if they all use the same underlying technology, at the cost of loss of platform diversity.
So, I’m not sure what the answer is. For BB, we will be performing a large and wide-reaching rearchitecting at some point (hopefully sooner than later), with a focus on “just the texts, ma’am”, and feeding that block of content into multiple renderers, including vanilla HTML. But I no longer think of our content in terms of how it looks in HTML anymore, and certainly don’t want to use any of its’ available formatting elements any longer.
Yeah I think what’s been happening is that we’ve been reinventing the application platform that’s been around since before the Internet. It’s just that no one wanted to actually implement that part of the OSI model of the network stack where session and presentation layers would be present. It’s unfortunate too since it would be nice to have protocols for those rather than bolting on more crap to CSS and JS. And I say this as a former desktop developer it’s just funny seeing JS frameworks for UI that are almost copies of WPF/XAML and other GUI toolkits. I’m just waiting for someone to copy TK/TCL for the web or even wxWidgets.
I hate that guy. My initials are RMS too.
I’d argue that his track record for predictions puts him solidly above mant of the ‘futurists’ who dabble in IT predictions(not merely ‘not wrong’; it’s just that nobody is happy when his come true; and it certainly isn’t clear that he’s the public face the countermovement needs(though, in fairness, the sorts of people who sign up for endless uphill battles that most people don’t even recognize the need for almost certainly tend to be an odd bunch, so you may or may not have much choice in the ‘pick one that’s also diplomatic’ area.)
There are really only two major ones I can think of that appear to have blindsided him: tivoization’s use of cryptographic tools to render software freedom irrelevant on so many devices people can actually buy; and ‘cloud’/‘SaaS’ making a model where you can sell software without ever distributing it in a way that makes the GPL kick in.
Tivioization came somewhat out of left field(I’m not sure who first implemented a crypto bootloader, I’m guessing it’s either “obscure fed MLS system nobody heard of”, “IBM, but only on mainframes that lease for more than your house” or “cable/cellular Telco conditional access implementation”; but would be interested to know more specifically); but the fact that it’s rapidly approaching(if it hasn’t already become) the default rather than the exception was far from obvious in the past.
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.