Originally published at: How to make a DIY book scanner - Boing Boing
…
dependig on how ok you are to have a book as .jpgs or rather a real text-file, a pretty good OCR-app for android (also good on even older phones/tablets and its fast, free and fairly accurate in most cases and works offline) is text fairy;
I shot a copy of boy wonder for a friend with my phone and OCRed right on it. took some time but worked.
I used this method about 8 years ago to scan my mom’s (typewritten) dissertation. Then used some free OCR software and added in bookmarks. Worked fantastically.
I scanned in all my old science fiction paperbacks about a year ago. People nod and smile until I tell them how I did it, then they break out the torches and pitchforks.
I used a paper cutter to slice off the spine of the book, then I fed the now-loose pages into a scanner with a document feeder. I used a Brother ADS-2700W. It could scan both sides of the page at once.
I got the process down to where I could do 20-30 books a day.
Took me 9 months. A tad over 2000 books. Roughly half a million pages.
Yes, the book was destroyed. But these were mass-market paperbacks. Not archival quality by any means. Not acid free paper. Some were already starting to crumble.
I haven’t OCR’d them yet, but I probably will eventually. They’re all saved as CBZ files.
Work area:
The problem is software. I actually have a dedicated CZUR book scanner, but I rarely use it because the software is so clunky to be almost unusable, and there is no open source or even reasonably cheap commercial software that can take a number of double page images and separate them into single pages with some smoothing of curves and other corrections.
It seems this would be a perfect application for machine learning, so I’m surprised there isn’t anything out there that bills itself as “AI powered”.
The software discussion on dewarping using depth information linked in the OP is fascinating, but doesn’t lead to any practical solution for existing hardware.
Only mostly kidding…
this is fascinating, brutal, but very practical. kinda like the finality of it.
Here, have some irony:
(A subplot of the book involves a library being digitized by throwing all the books into a shredder, photographing all the shreds as they’re being blown out the other end, and then digitally reassembling the book from the photographed shreds.)
I have a ScanSnap high speed 2-sided scanner for various loose documents and an Epson flatbed scanner and inkjet printer for books and various items that can’t be fed through the ScanSnap. If I have something I want to keep a copy of when I’m reading magazines in the library, I use a mu-43 camera to shoot the pages of interest. I can adjust exposure to make the page white and save the image as a RAW file to do any further corrections at home.
Holy shit dude. That’s dedication.
Yeah, pour one out for the books but those cheap paperbacks are going to slowly eat themselves away as it. (But they smell wonderful!) Now they are archived for others.
Did you make a repository for the files some where? Did you have any Shadow books? (Not sci fi, but worth a shot.)
I would have had a book scanner more than a decade ago, if Daniel Reetz hadn’t run out of some components (I forget which) and then I didn’t mind the delay and then ultimately he graciously offered me a full refund, but I didn’t have a scanner.
One of these days I’ll get the parts CNC routed from some nice Baltic birch (I’ve even now got a buddy who can do it), so I can scan my complete collection of Racecar Engineering from v1n1 to about 2005, and my (duplicate) collection of Cinefex 127-172 properly.
I keep meaning to upload them to archive.org, but haven’t yet.
Let them worry about the copyright stuff…
Whatever you do don’t shell out for the high-end HP large-bed scanner. I work with a non-profit that did, & the proprietary HP software is the worst. Hard to learn, not intuitive, & crashes frequently which requires re-booting the whole system every time. (Boot time is over 5 minutes, always.)
Well they certainly would be a good place to do it! For some of the older OOP stuff, there has to be some other sci-fi groups and sites that would also be a good place.
You probably already tried it, but for me, unpaper
worked very well (i still needed to fix a few exceptions, but most of the task run smoothly after a few tweaking on a smaller set).
I have tried both some book that another person from the internet scanned and some i’ve scanned myself, and i think were more regular than the one from the internet.
it also improved the images a bit and made the OCR process more accurate.
For scans that are mostly non text, but well scanned, imagemagick
was able to do it.
I haven’t actually used unpaper before, so thanks for the link! It sounds really powerful but doesn’t seem to provide the function I’m most looking for: to separate the two facing pages captured in one image into distinct images/pages of a PDF (and ideally compensating for the curve that is inevitable when scanning a book without a book cradle).
Imagemagick I do use a lot, mostly to make PDFs or GIFs out of image series. It truly is magick
Was it Neal Stephenson has a story where some awful techbro is trying to use an even more destructive scanning technique on an entire university library as a plot device? I’ve read a book like that, but I can’t remember who by or the title.
ETA: shoulda read to the end of the thread. Of course it was Rainbow’s end.
Z-library is probs your best bet for big, easily accessible book piracy. Errrr, I mean research archives. Yes, that’s it.
Obviously, it’s a play on shotgun sequencing,
but for this to be completely analogous, the library would have to have 12 copies of every book.
Many overlapping reads for each segment of the original DNA are necessary to overcome these difficulties and accurately assemble the sequence. For example, to complete the Human Genome Project, most of the human genome was sequenced at 12X or greater coverage; that is, each base in the final sequence was present on average in 12 different reads. Even so, current methods have failed to isolate or assemble reliable sequence for approximately 1% of the (euchromatic) human genome, as of 2004.[3]
Some of the proprietary scanning software that comes with scanners is certainly rubbish, but there are some decent open source alternatives for achieving the common tasks involved in converting a paper book to an ebook.
The open source python programming language has a library called Pillow that provides lots of useful functions for image manipulation. With relatively little programming skill it’s possible to kludge together scripts to perform the same image manipulation on all files in a directory, writing the results to a new directory. Splitting a scan of two facing pages into two separate images sounds easy. Moving the white point, so that the off-white page becomes pure white, drastically reducing file size, is straightforward.
https://pillow.readthedocs.io/en/stable/
gImageReader is open source OCR software for linux that works pretty well. Like lots of OCR software, it can struggle with italics.
Re copyright: I understand there are people who monitor archives like Z-Library, and when they can prove a work there is in public domain, they will liberate it to an appropriate Project Gutenberg site. (Copyright rules vary by country so there are multiple Project Gutenbergs hosted in different countries.)
Edited to add response to Doctor_Faustus below. Topic closed so can’t respond with a normal reply.
Fair point. I’d been staring at the amazing scanners people had built on the on the DIY site. My understanding was they were built in a way so the pages could be turned without moving the book or camera, and were then pressed flat by a glass sheet, so pages would be consistently placed on each image and there was no curvature to worry about. I then extrapolated to your scenario, incorrectly assuming your split point would be fixed. Sorry - when the split point isn’t fixed, I’ve got no good solution. Best I can offer is that GIMP will let fix the crop rectangle to a fixed size while still being able to move it about. Set it to the size of one page. Open file. Drag crop rectangle over left page. Crop and export. Ctrl-Z to uncrop. Drag crop rectangle over right page. Crop and export. I’ve done that 50+ times for an old book I was converting to Project Gutenberg. It saved a bit of time, but still painfully slow.