A web tool that converts PDF scientific papers into HTML

Originally published at: A web tool that converts PDF scientific papers into HTML | Boing Boing

2 Likes

I wonder how it will handle all the “COPYRIGHT NOTICE: DOWNLOADED ON [date] FROM [institution] – LICENSE #12345” floaty bits that the vendor PDF generator helpfully supplies every time you get an article through an institutional subscription.

2 Likes

So someone made a web site that does what a couple of open-source Linux command line utilities do. That’s…great.

https://manpages.ubuntu.com/manpages/trusty/man1/pdftohtml.1.html

Edit: in a rush to prove my superior clevertude, I neglected to read the article in sufficient detail, and now recognize the superficiality of my response.

4 Likes

This is an AI based tool that appears to wikify PDFs.

It’s a lot nicer than just copying a PDF into HTML.

PDFtoHTML is just a printer. It makes what amounts to a visual copy of a PDF in HTML.

That doesn’t really make the PDF more useful.

3 Likes

Well, next time, I guess I should RTFA! :smiley:

3 Likes

Or from sci-hub.

3 Likes

I believe if the pdf is already WCAG compliant this should be really simple.

Yeah, but it rendered the Abstract given as ‘The user has requested enhancement of the document.’
Maybe that was what you pasted in there to enhance it?

After loading up some academic papers in the humanities, I was sorely disappointed. Tried a physics paper. (Phys. Rev. Lett. 127, 081602 (2021) - Quantum Mechanics of Gravitational Waves)

And no, I don’t know what the hell the paper is actually talking about. PBS Spacetime “Journal Club” regularly goes over my head.

1 Like

The irony here is that all of those papers were written in LaTeX, still the only serious tool of choice for academic papers. The LaTeX is rendered out to PDF for easy emailing to journals, peers, etc. LaTeX is what HTML hoped to be when it grew up, before the weird CSS cousins moved in and trashed the house, ruining everyone’s dreams.

If instead we could get the original LaTeX code and make that more available, it would be much easier to make clean and readable versions in HTML, Markdown, or your other markup/procedural-formatting tool of choice.

But I guess that’s a nirvana fallacy and if converting the PDFs back to something like HTML helps people, more power to 'em. Even if it does amount to playing a CD over the radio, then recording that broadcast back to a cassette tape in order to give to someone.

6 Likes

[quote=“VeronicaConnor, post:10, topic:205236”]

the texfaq seems non committal.

Plus, compiling other people’s tex/latex is a hit or miss affair, because of missing dependencies. A DVI is sometimes more useful, but pdftex doesn’t use DVI.

I didn’t say it was easy or practical- quite the opposite (hence my calling myself out on the nirvana fallacy).

I just wanted to point out the amusing irony in going procedural to PDF and back to a different procedural. It’s rather like printing out a text file, then scanning and OCRing it back in to a different document format. Perhaps reasonable if the original text file is unavailable or impractical to use, but it’s funny to me nonetheless.

3 Likes

There are still quite a few scientific journals, especially in the life sciences, that accept submissions in Word. I had an irritating experience writing a paper with a zoologist (for an ethology journal) where he wrote his parts in Word, I wrote mine in LaTeX, for amalgamation I painstakingly converted his part to LaTeX, and then when the paper was accepted the journal insisted that we provide the source in Word.

2 Likes

Still? I’d be hard pressed to find any journal in archaeology that accepts anything other than Word. Which is fine, because I have absolutely no use case for LaTeX myself. LaTeX was first and foremost an attempt to introduce proper typography in a time when word processors couldn’t do that as a matter of course. But journals impose their own style on the text anyway, so I don’t see why LaTeX should be used over any other markup scheme.

As for typography: I much prefer setting my own documents in DTP software rather than relying on the rigid template approach of LaTeX. It might be good if you use any equations but that isn’t a concern for me.

TL;Dr: what’s with the condescension? Other fields have other requirements

2 Likes

Me? I was just responding to another poster’s suggestion to the contrary.

1 Like

I felt you were rather condescending towards your co-author’s use of Word. If I misread that I apologise.

Not at all. He used Word for his part because it was relatively low on equations, and it is what he normally uses in his field. My part of the paper was full of equations, some rather complicated, so we mutually thought it made more sense to convert it all to LaTeX, as that conversion was straightforward. The ultimate conversion in the other direction was a nightmare, despite the fact that I’m comfortable with Word.

I wrote my PhD thesis in WordStar, with printer escape sequences for the special characters. It felt a little like that.

4 Likes

This topic was automatically closed after 5 days. New replies are no longer allowed.