A case study in PDF forensics: The Epstein PDFs

265 points · DuffJohnson · 15 hours ago

pdfa.org

anigbrowl11 hours ago
I found this part interesting:
There are also other documents that appear to simulate a scanned document but completely lack the “real-world noise” expected with physical paper-based workflows. The much crisper images appear almost perfect without random artifacts or background noise, and with the exact same amount of image skew across multiple pages. Thanks to the borders around each page of text, page skew can easily be measured, such as with VOL00007\IMAGES\0001\EFTA00009229.pdf. It is highly likely these PDFs were created by rendering original content (from a digital document) to an image (e.g., via print to image or save to image functionality) and then applying image processing such as skew, downscaling, and color reduction.
ted_bunny13 hours ago
Has anyone analysed JE's writing style and looked for matches in archived 4chan posts or content from similar platforms? Same with Ghislaine, there should be enough data to identify them atp right? I don't buy the MaxwellHill claims for various reasons but it doesn't mean there's nothing to find.
yonatan807012 hours ago
A bit off-topic, but I find it kinda funny that the "Decline" button on the cookie popup on this page is labled "Continue without consent".
waynenilsen14 hours ago
> Information leakage may also be occurring via PDF comments or orphaned objects inside compressed object streams, as I discovered above.
hopefully someone is independently archiving all documents
my understanding is that some are being removed
embedding-shape14 hours ago
Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.
JKCalhoun44 minutes ago
Interesting, there are a handful of PDFs in the drop that appear to be an email with a Base64 encoded attachment—inline.
OCR is so bad of course that decoding the Base64 seems futile without a lot of effort.
Example: https://www.justice.gov/epstein/files/DataSet%2011/EFTA02609...
(More mentioned here: https://old.reddit.com/r/Epstein/comments/1qu9az2/theres_unr...)
originalvichy14 hours ago
Any guesses why some of the newest files seem to have random ”=” characters in the text? My first thought was OCR, but it seemed to not be linked to characters like ”E” that could be mistakenly interpreted by an OCR tool. My second guess is just making it more difficult to produce reliable text searches, but probably 90% of HN readers could find a way to make a search tool that does not fall apart in case a ”=” character is found (although making this work for long search queries would make the search slower).
Beijinger9 hours ago
What would be more interesting: His Bank accounts.
Who paid him?
Who did get paid?
_def13 hours ago
I can't even download the archive, the transmission always terminates just before its finished. Spooky.
nkozyra14 hours ago
> DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata
Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right?
bugeats13 hours ago
Somebody ought to train an LLM exclusively on this text, just for funsies.
shevy-java6 hours ago
So I have been wondering about this ...
Some of the gathered data is shown here, right? Probably not all.
Now ... that's static information though. That's not really an analysis, most definitely not an independent (open ended) analysis. And it will only show a very incomplete part of the full picture.
This is why I think the "release the files" movement, as good as they are, seems incomplete. I'd rather know a lot more about how they operate their networks, getting away involving underage women. How about secret services of other countries? Should that not also be highly important? So why is there not really a larger investigation as well as independent analysis? Those .pdf files alone can not tell the whole picture. That can just be the tip of the iceberg; and it evidently involves other countries too, with Prince Andrew being the most famous here (aka, the UK, but we already saw that other countries also have similar issues where people suddenly had to step away from politics when it was found out they visited the party-locations of Mr. Epstein).
corygarms14 hours ago
These folks must really have their hands full with the 3M+ pages that were recently released. Hoping for an update once they expand this work to those new files.
tibbon14 hours ago
That's a lot of PeDoFiles!
(But seriously, great work here!)
mmooss12 hours ago
What is the legal basis for releasing the someone's private files and communications? If they can do it to Epstein, they can do it to you, to the Washington Post journalist, to former President Clinton, etc.
Is the scope at least limited somehow? Generally I favor transparency, but of course probably the most important parts are withheld.
meidan_y15 hours ago
(2025) just follow hn guideline, impressive voter ring though
NoToP12 hours ago
This is so incredibly useful to me right now for incidental reasons I am commenting to make sure I can get back to it.

Loading comments...

news.ycombinator.com/item?id=46886440

anigbrowl11 hours ago
I found this part interesting:
There are also other documents that appear to simulate a scanned document but completely lack the “real-world noise” expected with physical paper-based workflows. The much crisper images appear almost perfect without random artifacts or background noise, and with the exact same amount of image skew across multiple pages. Thanks to the borders around each page of text, page skew can easily be measured, such as with VOL00007\IMAGES\0001\EFTA00009229.pdf. It is highly likely these PDFs were created by rendering original content (from a digital document) to an image (e.g., via print to image or save to image functionality) and then applying image processing such as skew, downscaling, and color reduction.
ted_bunny13 hours ago
Has anyone analysed JE's writing style and looked for matches in archived 4chan posts or content from similar platforms? Same with Ghislaine, there should be enough data to identify them atp right? I don't buy the MaxwellHill claims for various reasons but it doesn't mean there's nothing to find.
yonatan807012 hours ago
A bit off-topic, but I find it kinda funny that the "Decline" button on the cookie popup on this page is labled "Continue without consent".
waynenilsen14 hours ago
> Information leakage may also be occurring via PDF comments or orphaned objects inside compressed object streams, as I discovered above.
hopefully someone is independently archiving all documents
my understanding is that some are being removed
embedding-shape14 hours ago
Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.
JKCalhoun44 minutes ago
Interesting, there are a handful of PDFs in the drop that appear to be an email with a Base64 encoded attachment—inline.
OCR is so bad of course that decoding the Base64 seems futile without a lot of effort.
Example: https://www.justice.gov/epstein/files/DataSet%2011/EFTA02609...
(More mentioned here: https://old.reddit.com/r/Epstein/comments/1qu9az2/theres_unr...)
originalvichy14 hours ago
Any guesses why some of the newest files seem to have random ”=” characters in the text? My first thought was OCR, but it seemed to not be linked to characters like ”E” that could be mistakenly interpreted by an OCR tool. My second guess is just making it more difficult to produce reliable text searches, but probably 90% of HN readers could find a way to make a search tool that does not fall apart in case a ”=” character is found (although making this work for long search queries would make the search slower).
Beijinger9 hours ago
What would be more interesting: His Bank accounts.
Who paid him?
Who did get paid?
_def13 hours ago
I can't even download the archive, the transmission always terminates just before its finished. Spooky.
nkozyra14 hours ago
> DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata
Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right?
bugeats13 hours ago
Somebody ought to train an LLM exclusively on this text, just for funsies.
shevy-java6 hours ago
So I have been wondering about this ...
Some of the gathered data is shown here, right? Probably not all.
Now ... that's static information though. That's not really an analysis, most definitely not an independent (open ended) analysis. And it will only show a very incomplete part of the full picture.
This is why I think the "release the files" movement, as good as they are, seems incomplete. I'd rather know a lot more about how they operate their networks, getting away involving underage women. How about secret services of other countries? Should that not also be highly important? So why is there not really a larger investigation as well as independent analysis? Those .pdf files alone can not tell the whole picture. That can just be the tip of the iceberg; and it evidently involves other countries too, with Prince Andrew being the most famous here (aka, the UK, but we already saw that other countries also have similar issues where people suddenly had to step away from politics when it was found out they visited the party-locations of Mr. Epstein).
corygarms14 hours ago
These folks must really have their hands full with the 3M+ pages that were recently released. Hoping for an update once they expand this work to those new files.
tibbon14 hours ago
That's a lot of PeDoFiles!
(But seriously, great work here!)
mmooss12 hours ago
What is the legal basis for releasing the someone's private files and communications? If they can do it to Epstein, they can do it to you, to the Washington Post journalist, to former President Clinton, etc.
Is the scope at least limited somehow? Generally I favor transparency, but of course probably the most important parts are withheld.
meidan_y15 hours ago
(2025) just follow hn guideline, impressive voter ring though
NoToP12 hours ago
This is so incredibly useful to me right now for incidental reasons I am commenting to make sure I can get back to it.