Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The data lake is also a real, live GDPR PII time bomb if you worked out how to get the data in but not take it out


But if I haven't spent the effort to extract it, do I really own it? Let me argue that I don't have it because all my implemented queries turn up none of your data. You wouldn't tax me on gold that hasn't yet been extracted, would you? (End of joke.)


  But if I haven't spent the effort to extract it, do I really own it? 
If you collected it, you are responsible for it.


What I think you'd typically do is put different data under different keys/paths, so that red is personally identifiable data, yellow contains pointers to such data, and green is just regular data. You could have a structure like s3://my-data-lake/{red|yellow|green}/{raw|intermediate}/year={year}/month={month}/day={day}/source={system}/dataset={table}

Then you just don't keep red data for longer than 30 days.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: