Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm quoted in this article. Happy to discuss what we're working on at the Library Innovation Lab if anyone has questions.

There's lots of people making copies of things right now, which is great -- Lots Of Copies Keeps Stuff Safe. It's your data, why not have a copy?

One thing I think we can contribute here as an institution is timestamping and provenance. Our copy of data.gov is made with https://github.com/harvard-lil/bag-nabit , which extends BagIt format to sign archives with email/domain/document certificates. That way (once we have a public endpoint) you can make your own copy with rclone, pass it around, but still verify it hasn't been modified since we made it.

Some open questions we'd love help on --

* One is that it's hard to tell what's disappearing and what's just moving. If you do a raw comparison of snapshots, there's things like 2011-glass-buttes-exploration-and-drilling-535cf being replaced by 2011-glass-buttes-exploration-and-drilling-236cf, but it's still exactly the same data; it's a rename rather than a delete and add. We need some data munging to work out what's actually changing.

* Another is how to find the most valuable things to preserve that aren't directly linked from the catalog. If a data.gov entry links to a csv, we have it. If it links to an html landing page, we have the landing page. It would be great to do some analysis to figure out the most valuable stuff behind the landing pages.



A common metric for how much actual content has changed is the Jaccard Index. Even for large numbers of datasets that are too large to fit in memory it can be approximated with various forms of MinHash algorithms. Some write up here: https://blog.nelhage.com/post/fuzzy-dedup/

https://en.wikipedia.org/wiki/Jaccard_index


> sign archives with email/domain/document certificates

I do a bit of web archival for fun, and have been thinking about something.

Currently I save both response body and response headers and request headers for the data I save from the net.

But I was thinking that maybe if instead of just saving that, I could go a level deeper and preserve actual TCP packets and TLS key exchange stuff.

And then, I might be able to get a lot of data provenance “for free”. Because if in some decades when we look back at the saved TCP packets and TLS stuff, we would see that these packets were signed with a certificate chain that matches what that website was serving at the time. Assuming of course that they haven’t accidentally leaked their private keys in the meantime and that the CA hasn’t gone rogue since etc.

To me I think that would make sense to build out web archival infra that preserves the CA chain and enough to be able to see later that it was valid. And if many people across the world save the right parts we don’t have to trust each other in order to verify that data that the other saved was also really sent by the website our archives say it was from.

For example maybe I only archived a single page from some domain, and you saved a whole bunch of other pages from that domain around the same time so the same certificate chain was used in the responses to both of us. Then I can know that the data you are saying you archived from them really was served by their server because I have the certificate chain I saved to verify that.


In terms of tooling there's scoop[0] which does a lot of the capture part of what you're thinking about. The files it creates include request headers and responses, TLS certificates, PDF and screenshots and it has support for signing the whole thing as proof of provenance.

Overall though I think archive.org is probably sufficient proof that a specific page had certain content on a certain day for most purposes today.

0. https://github.com/harvard-lil/scoop


The idea is good, as far as I understand TLS however, the cert / asymmetric key is only used prove the identity/authenticity of the cert and thus the host for this session.

But the main content is not signed / checksummed with it, but rather a symmetrical session key, so one could probably manipulate this in the packet dump anyway.

I read about a Google project named SXG (Signed HTTP exchanges) that might do related stuff, albeit likely requiring the assistance of the publisher


"TLS-N", "TLS Sign", and maybe a couple others were supposed to add non-repudiation.

But they didn't really go anywhere:

https://security.stackexchange.com/questions/52135/tls-with-...

https://security.stackexchange.com/questions/103645/does-ssl...

There are some special cases, like I think certain headers for signing e-mails, that do provide non-repudiation.

For that, `tcpdump` with `SSLKEYLOGFILE` will probably get you started on capturing what you need.


To extend this to archival integrity without cooperation from the server/host, you'd need the client to sign the received bytes.

But then you need the client to be trusted, which clashes with distributing.

Hypothetically, what about trusted orgs standing up an endpoint that you could feed a URL, then receive back attestation from them as to the content, then include that in your own archive?

Compute and network traffic are pretty cheap, no?

So if it's just grabbing the same content you are, signing it, then throwing away all the data and returning you the signed hash, that seems pretty scalable?

Then anyone could append that to their archive as a certificate of authenticity.


Reminds me of timestamp protocol and timestamp authorities.

Not quite the same problem, but similar enough to have a similar solution. https://www.ietf.org/rfc/rfc3161.txt


Unfortunately, the standard TLS protocol does not provide a non-repudiation mechanism.

It works by using public key cryptography and key agreement to get both parties to agree on a symmetric key, and then uses the symmetric key to encrypt the actual session data.

Any party who knows the symmetric key can forge arbitrary data, and so a transcript of a TLS session, coupled with the symmetric key, is not proof of provenance.

There are interactive protocols that use multi-party computation (see for example https://tlsnotary.org/) where there are two parties on the client side, plus an unmodified server. tlsnotary only works for TLS1.2. One party controls and can see the content, but neither party has direct access to the symmetric key. At the end, the second party can, by virtue of interactively being part of the protocol, provably know a hash of the transaction. If the second party is a trusted third party, they could sign a certificate.

However, there is not a non-interactive version of the same protocol - you either need to have been in the loop when the data was archived, or trust someone who was.

The trusted third party can be a program running in a trusted execution environment (but note pretty much all current TEEs have known fault injection flaws), or in a cloud provider that offers vTPM attestation and a certificate for the state (e.g. Google signs a certificate saying an endorsement key is authentically from Google, and the vTPM signs a certificate saying a particular key is restricted to the vTPM and only available when the compute instance is running particular known binary code, and that key is used to sign a certificate attesting to a TLS transcript).

I'm working on a simpler solution that doesn't use multiparty computation, and provides cloud attestation - https://lemmy.amxl.com/c/project_uniquonym https://github.com/uniquonym/tls-attestproxy - but it's not usable yet.

Another solution is if the server will cooperate with a TLS extension. TLS-N (https://eprint.iacr.org/2017/578.pdf) provides a solution for this. That provides a trivial solution for provenance.


As important as cryptography is, I also wonder how much of it is trying to find technical solutions for social problems.

People are still going to be suspicious of each other, and service providers are still going to leak their private keys, and whatnot.


You may be interested in Reclaim Protocol and perhaps zkTLS. They have something very similar going and the sources are free.

https://github.com/reclaimprotocol

https://drive.google.com/file/d/1wmfdtIGPaN9uJBI1DHqN903tP9c...

https://www.reclaimprotocol.org/

https://docs.lighthouse.storage/lighthouse-1/zktls


It’s an interesting idea for sure. Some drawbacks I can think off:

- bigger resource usage. You will need to maintain a dump of the TLS session AND an easily extractable version

- difficulty of verification. OpenSSL / BoringSSL / etc. will all evolve and say, completely remove support for TLS versions, ciphers, TLS extensions… This might make many dumps unreadable in the future, or requiring the exact same version of a given software to read it. Perhaps adding the decoding binary to the dump would help, but then, you’d get Linux retro-compatibility issues.

- compression issues: new compression algorithms will be discovered and could reduce data usage. You’ll have a hard time doing that since TLS streams will look random to the compression software.

I don’t know. I feel like it’s a bit overkill — what are the incentives for tampering with this kind of data?

Maybe a simpler way of going about it would be to build a separate system that does the « certification » after the data is dumped; combined with multiple orgs actually dumping the data (reproducibility), this should be enough the prove that a dataset is really what it claims to be.


Just commenting to double-down on the need for cryptographic timestamping - especially in the current era of generative AI.



How does that work exactly? Does it all still hinge on trusting a know Time Stamp Authority, or is there some way of time stamping in a trustless manner?


I'm so sad roughtime never got popular. It can be used to piggyback a "proof of known hash at time" mechanism, without blockchain waste.

https://www.imperialviolet.org/2016/09/19/roughtime.html

https://int08h.com/post/to-catch-a-lying-timeserver/

https://blog.cloudflare.com/roughtime/

https://news.ycombinator.com/item?id=12599705


You can publish the hash in some durable medium, like the classified section of a newspaper.

This proves you generated it before this time.

You can also include in the hash the close of the stock market and all the sports scores from the previous day. That proves you generated it after that time.


My mind immediately went to adversarial fixing of all sports games and the stock market in order to create a collision.

Sports sports are an interesting source of entropy.


If you are looking to prove that something happened after a certain timestamp, you can use a randomness beacon[0]. Every <interval>, the beacon outputs a long random number. Include the timestamped random number into your artifact.

You are relying upon the authority of the beacon to be random, but good practice is to utilize multiple independent beacons.

[0] https://csrc.nist.gov/projects/interoperable-randomness-beac...


This is the one thing blockchains are truly good for.


But there has to be economic incentives to maintain the data, and only Bitcoin can even to begin to make that claim, and even it is only 16 years old.

Still, Open Timestamps does exactly this, and had been running for over 8 years now.


Yeah it definitely could be, though you may similarly find yourself in a spot of trusting a limited number of nodes that guarantee the chain was never tampered with.


For something like this there’s ways to minimize how much you need to trust nodes such as regularly publishing hashes to 3rd parties like HN.

Not so useful if something was edited a few minutes after posting, but it makes it more difficult for a new administration to suddenly edit a bunch of old data.


> there’s ways to minimize how much you need to trust nodes such as regularly publishing hashes to 3rd parties like HN.

But you could do the same thing with any hashes, right? There is no need for a blockchain in the middle.


What happens as websites disappear? With a blockchain in 2090 you can point to a website post in 2060 as support that your hashes on data posted in 2030 are still valid. That’s useful when preventing people from rewriting history is the goal.

There’s also a size advantage. You can keep a diff on the archive for each hash being posted instead of the full index for every time you post a hash.


You make use of several independent authorities for each timestamped document.

The chance is exceedingly low that the PKI infrastructure of all the authorities becomes compromised.


I'd love to learn more about what is in scope of the Library Innovation Lab projects. Is it targeting data.gov specifically or all government agency websites?

Given the rapid take downs of websites (cdc, usaid) do you have a prioritization framework for which website pages to prioritize or do you have "comprehensive" coverage of pages (in scope of the project)?

As you allude to, I've been having a hard time learn about what sort of duplicate work might be happening given that there isn't a great "archived coverage" source of truth for government websites (between projects such as End of Term archive, Internet archive, research labs, and independent archivists).

Your open questions are interesting. Content hashes for each page/resource would be a way to do quick comparisons, but I assume you might want to set some threshold to determine how much it's changed vs if it changed?

Is the second question about figuring out how to prioritize valuable stuff behind two depth traversals? (ex data.gov links to another website and that website has a csv download)


As a library, the very high level prioritization framework is "what would patrons find useful." That's how we started with data.gov and federal Github repos as broad but principled collections; there's likely to be something in there that's useful and gets lost. Going forward I think we'll be looking for patron stories along the lines of "if you could get this couple of TB of stuff it would cover the core of what my research field depends on."

In practice it's some mix of, there aren't already lots of copies, it's valuable to people, and it's achievable to preserve.

> Is the second question about figuring out how to prioritize valuable stuff behind two depth traversals?

Right -- how do you look at the 300,000 entries and figure out what's not at depth one, is archivable, and is worth preserving? If we started with everything it would be petabytes of raw datasets that probably shouldn't be at the top of the list.


Thank you for this effort.


Hi! Is there any one place that would be easiest for folks to grab these snapshots from? Would love to try my hand at finding documents that moved/documents that were removed.


Hmm, I can put them here for now: https://source.coop/harvard-lil/data-gov-metadata

Unfortunately it's a bit messy because we weren't initially thinking about tracking deletions. data_20241119.jsonl.zip (301k rows) and data_20250130.jsonl.zip (305k rows) are simple captures of the API on those dates. data_db_dump_20250130.jsonl.zip (311k rows) is a sqlite dump of all the entries we saw at some point between those dates. My hunch is there's something like 4,000 false positives and 2,000 deletions between the 311k and 305k set, but that could be way off.


Very cool! I take a look :)


How can people help? Sounds like a global index of sources is needed and the work to validate those sources, over time, parceled out. Without something coordinated I feel like it is futile to even jump in.


I spent a bunch of time on this project feeling like it was futile to jump in and then just jumped in; messing with data is fun even if it turns out someone else has your data. But the government is huge; if you find an interesting report and then poke around for the .gov data catalog or directory index structure or whatever that contains it, you're likely to find a data gathering approach no one else is working on yet.

There's coordinated efforts starting to come together in a bunch of places -- some on r/datahoarders, some around specific topics like climate data (EDGI) or CDC data, there's datasets being posted on archive.org. I think one way is to find a topic or kind of data that seems important and search around for who's already doing it. Eventually maybe there'll be one answer to rule them all, but maybe not; it's just so big.


Very tangentially related, but it always makes me smile to see rclone mentioned in the wild - its creator ncw was the CEO of the previous company I worked at.


Trump did this last time too. Is there a difference in the level of preparedness in archiving data compared to last time? If so, in what way is it different? Is there institutional or independent preparedness?


(Note my lab isn't partisan and this isn't a partisan effort; public data always needs saving. But there's definitely a reason people are paying attention right now.)

I think in some ways the community was less prepared this time, because there was a lot of investment in 2016-2017 and then many of the archives created at that point didn't end up being used; partly because the changes at the federal level turned out to be smaller and slower in 2017 than they're looking like this time. So some people didn't choose to invest that way this time around.

[Edit: this means I think it's really important that data archives are useful. Sorting through data and putting a good interface on it should help people out today as well as being good prep for the future.]

In other ways there's much more preparation; EOT Archive now has a regular practice of crawling .gov websites before and after each change of administration, which is a really great way of giving citizens a sense of how their government evolves. It will just tend to miss data that you can't click to in a generic crawl.



> "Project Russia" is spiritual warfare

> https://washingtonspectator.org/project-russia-reveals-putin...

What in the Cold War conspiracy theory was that...

>> The Kremlin’s design necessarily depends on the adoption of a single world belief system or religion. Expect a syncretic, gnostic blend, rooted in hierarchy — the Russian Orthodox Church at the core, and other religious factions accorded favor based on demonstrated fealty.

Good luck with that. The most popular religions today are forked versions of one guys story that they couldn't agree on and have been involved in acts of genocide towards one another because of it--for centuries.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: