Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most discussion I found about the topic is how to extract information. Is there any technique for extracting interactive elements? I reckon listing all of inputs/controls would not be hard, but finding the corresponding labels/articles might be tricky.

Another thing I wonder is, regarding text extraction, would it be a crazy idea to just snapshot the page and ask it to OCR & generate a bare minimum html table layout. That way both the content and the spatial relationship of elements are maintained (not sure how useful but I'd like to keep it anyway).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: