Some copiers randomly change the numbers on documents

This old story was disturbing as well: http://youtu.be/iC38D5am7go

You should have read the discussion on Slashdot before posting this article. There are several important things missing from your post.

This happens when you scan to PDF, not when you copy or scan to tiff.
Xerox is even aware of the problem, because there are several mentions of this problem in the 328 page manual for the copier http://www.cs.unc.edu/cms/help/help-articles/files/xerox-copier-user-guide.pdf . The first mention is at page 107. “Normal/Small produces small files by using advanced compression techniques. Image quality is acceptable but some quality degradation and character substitution errors may occur with some originals”. Scary thing is, this setting is the default setting for compression, not an extra setting where you would get big flashing warning on the screen when selected.
As one of previous posters said, it is actually caused by the JBIG2 compression algorithm.

3 Likes

Gasp! The doctored birth certificate! Here is the proof!

3 Likes

A lot of old engineering documents and prints are scanned to PDF and then put into a controlled documents database. Then the old documents are sent to a warehouse (instead of an on-site library like they used to be). When you want a copy of the controlled document, you print it out from the database or find a controlled copy that was printed from the database. So this type of issue in PDFs could certainly propagate. I’m not an expert in document control, but I certainly hope that the people who are experts are aware of these issues. I’m certainly going to be a little more cautious of older documents that were scanned.

How does a “reputable” company like Xerox allow this sort of problem to be unchecked

I will from now assume the Dilbert cartoon is based on Xerox

Some copiers just want to watch the world burn.

2 Likes

The big problem is that the ‘scan to PDF’ function is used by offices who are trying to go paperless – which means they’re shredding the original paper documents once scanned. This could be a big impact in certain industries. The workers in my late father’s insurance brokerage are probably pulling their hair out right now.

“But it says right here on the copy of the search warrant that it’s 1318 Mockingbird Lane”

1 Like

From the article it appears the substitutions are between small 8’s and small 6’s and are all in one direction. It’s a problem, but its as much a problem of the legibility of the original, isn’t it?

Nope, not just a problem with legibility. I just tried it out with our office printer, and the printer is adding the dimple on the left side of the “8” to the “6”, and they are often a different size and misaligned. It looks very much like it’s using the template of the “8” that it got somewhere else and substituting it over the “6”. This can be a huge problem for companies. For example in our Purchasing department, we will often digitize vendor contracts to make them accessible worldwide, and these contracts will have tiny pricing tables and schematics. As these become the “official” agreements, there can potentially be disagreements on product price or even design. Not good.

Right, so looking further into the JBIG2 problem… It was considered a revolutionary breakthrough (1999). This is not just a Xerox problem - using any device with JBIG2 lossy compression on (the default -?) Normal - runs the risk.

The JBIG.org site lists Compression Success Stories from the legal industry, homeland security, and the media monitoring industry (newspapers).

There’s some talk at Wikipedia about this and they’ve updated to reflect this news, but I wish we had a full list of devices at risk (and suspect we soon will).

This is potentially a huge, tangly mess, just taking into consideration the inevitable misinformation, misunderstandings, etc. Comments here and there are already going conspiracy on it. Maybe the sooner people start taking it seriously, the better - to find out exactly what the ramifications might be…

2 Likes

The best copiers are Samsung.

My Xerox rep sent me this…

http://realbusinessatxerox.blogs.xerox.com/2013/08/06/always-listening-to-our-customers-clarification-on-scanning-issue

The structure of the JBIG2 standard really isn’t going to help when it comes to misinformation and misunderstandings.

From the JBIG.org site:

“Because JBIG2 is a smart compression standard, it has strict specifications to decode a file, but no precise specifications for how to encode a file. As noted earlier, this allows a sophisticated vendor to employ a variety of techniques to increase the compression ratio.”

So, in addition to the fact that there are both ‘lossless’ and ‘lossy’ modes that are both called “JBIG2 compression”(depending on the quality of your vendor’s documentation/UI which one you are using may not be clear), the behavior of ‘lossy’ compressors may vary wildly between vendors, so long as the result is a conformant JBIG2 data structure.

The JBIG2 people even note (about the very feature that is causing problems in the Xerox units)

"The Dangers of PM&S: Proceed with Caution

Like all powerful tools, it is essential that PM&S be used correctly. Among the worst mistakes a JBIG2 encoder can make is a font substitution error, commonly known as a mismatch. If an encoder mistakenly includes a character in the wrong font, it will replace that character with the mistaken font in the compressed file. This creates a typo that will be seen in the compressed document. This misspelled word will confuse those who read the document and will cause an OCR engine that processes the compressed file to generate the wrong textual information. The only way to recover the lost information would be to recover it from the original document.

The ability to use PM&S presents many JBIG2 vendors with a dilemma. In order to stay competitive and get the best compression rates, they need to map as many characters as possible to the same font. A single mismatch, though, can potentially make the document worthless. Since the JBIG2 specs have nothing to say on which characters can be safely matched together and which can’t, each JBIG2 vendor must develop their own proprietary algorithms. These algorithms involve sophisticated computer vision techniques. It is therefore not uncommon to find mismatches produced by many JBIG2 implementations, especially from the more recent entrants into the field.

These mismatches can severely degrade image quality. Here is a sample from a typical image file. The top half of the figure below shows what the original above looked like after lossy compression by a typical JBIG2 vendor. By way of contrast, the same document when compressed by a second vendor (CVision PdfCompressor), seen on the bottom half of the figure below, is accurate." The sample TIFF, the example corrupted output

Unfortunately, they have no real answer to this problem except “Use the lossless version” or “You should verify compressed documents to ensure that they haven’t been corrupted”.

As long as they adhere to their goal of “Perceptually lossless” lossy compression, this is going to be A Problem. As they put it:

“It is crucial to distinguish between an implementation that sacrifices image quality in order to get a compression savings and one that gets a compression savings through improving image quality. Perceptually lossless JBIG2 mode is where there appears to be significant ROI (return on investment) for the digital imaging industry. This is the mode where digital devices and document management systems can see real benefit from utilizing JBIG2 technology. It truly provides the best of both worlds. The file size is similar to what a naive JBIG2 lossy implementation produces, while the image quality of the original is maintained or even improved.”

The downside, not mentioned, is that unlike conventional lossy compression(like the various familiar image compression formats), which gradually introduce visible artefacts as you turn the screws and throw away more and more data, their lossy compression is deliberately designed to maintain perfect appearance while throwing away data. If your machine vision-fu is arbitrarily good, this can work (eg. a text document, even with fonts and formatting info, is tiny and tends to compress well, while a picture of a page of text, complete with fuzz, dust, and chromatic abberation adding spurious color, is enormous and compresses poorly, so an OCR engine can theoretically act as a compression mechanism of extreme power); but if it isn’t, you can hardly think of a better way to silently introduce plausible-looking errors into your output, which is about the most grievous sin one can imagine in a compression system.

Given that the JBIG2 standard places relatively few restrictions on implementation, and that almost all implementations are libraries produced by vendors that end-customers never interact with (or even know about); but which are licenced and integrated by companies selling document-handling hardware or software, it is going to be an epic mess to determine exactly which JBIG2-lossy implementations are bad, and which products include those implementations (possibly differing between firmware versions!). And, of course, there may or may not be anything, short of manual verification from hard copy that might not exist anymore, to be done with defective output, even if you can trace the origin of a given compressed document.

Epic, simply epic.

2 Likes

Thanks for reading and explaining, fuzzy. Epic, indeed.

I was soft-pedaling, because I wasn’t sure (not reading my own links - fail). This needs to be written and explained correctly for everyone. ?

~Lauren

Edit: I did read enough to be alarmed, but you’ve nailed the most important points. I have no idea how the correct info can/does get into the wild or better - such as an update here on Boing …anyone?

This topic was automatically closed after 5 days. New replies are no longer allowed.