Hacker Newsnew | past | comments | ask | show | jobs | submitlogin



Neither openlaws nor public.resource actually let you just pull the laws in a common format (or the scrapers) as far as I can tell?

I was thinking something more along the lines of a git repo per state.


Who will maintain the git repo per state [1] [2]? There is value in a pipeline that continually ingests this data from various sources and pushes it into the Internet Archive, but if you wish to treat it as authoritative, it must have a human minding it, because of entropy and decay. Even the Python Software Foundation has a budget of ~$5M/year. Hence my openlaws.us example.

If it was as easy as writing a scraper and dumping it all in a bucket or repo, it'd already be done. It's just the usual thankless hard work over time grind.

[1] https://xkcd.com/2347/

[2] https://en.wikipedia.org/wiki/Free-rider_problem


I know a thing or two about that, 2,400 commits to the scrapers powering openstates over the past 9 years.

Even with openstates, we have an API but don't "just" dump the bills to git for legacy nerd reasons.

The nice thing about laws is that the host websites (or PDFs) don't change templates _that_ often, so generally you can rescrape quarterly (or in some states, annually) without a ton of maintenance. With administrative codes you need to scrape more often, but the websites are still pretty stable.

The downside is that codes in particular are often big, so a single scrape might need to make 20,000 or more requests, so you have to be very careful about rate limiting and proxies, which goes to my original point that it sucks that accessing this stuff is such a mess.


The assessor's office in my county provides data older than two weeks (iirc) as a few sql export dumps because that's how things were done back in the day.

Current information is gated behind a web2.0 view of their live data with severe limits. It wasn't designed to be scraped and is in fact hostile to the attempt. I'd imagine they're seeing rising hosting costs and that they'll keep rising.

I should reach out to them and see what this looks like from their angle. The local commercial real estate community is pretty tech-savvy and I'm wondering if we could all be a bit more proactive around data access.

I'd love to hear your thoughts on county vs state vs national data! I'd be very interested in any bandwidth usage or processing requirement info you might have recorded.


Fair, I stand corrected. Thanks for your work. All Openness efforts are welcome.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: