Hacker Newsnew | past | comments | ask | show | jobs | submit | asciimoo's commentslogin

Ohi, author here. The index you build can indeed contain sensitive data, but you have the ability to specify URL patterns to skip the indexing of the matching pages.

Personally, I'd love this is if it were opt-in. That way, I could gradually reduce my repeat search dependence based on me recognizing my actual habits, rather than giving a browser extension carte blanche access to my entire search history. Maybe that's already possible, but I didn't see any documentation about the config file.

I'm using & developing Omnom (read-only demo: https://omnom.zone/ ). It is self-hosted, free software, fediverse compatible and creates 1:1 snapshots of the saved websites: https://github.com/asciimoo/omnom


This looks great! Does it capture the website from what is currently rendered in the browser, or does it get it through a separate get request? In other words, if I am on a site that is only locally available or is logged in, will it still capture the website?


Great project name!


It's not only open source, it is free software. Take a look at https://github.com/asciimoo/omnom - suggestions/contributions are appreciated =)


That looks like a pretty heavy-weight solution, with a lot of complexity, and I don't mean that as a criticism at all. I'm not a 'go' developer myself. I've always wanted a pure JS solution (as a browser extension, maximum of 200 lines of code) that can capture the content of a web page (doing a virtual scroll to the bottom, to capture the whole page). Since there's no perfect way to translate HTML to PDF, my idea had always been to capture the IMAGE of the page (aside from capturing keywords for DB indexing which can be done separately just for 'search' support later on).

The fly in the ointment is of course the scrolling too, because some apps have "infinite" scrolling, and so in many SPAs there's literally no such thing as "The whole page". Anway, I haven't tried your app yet, because of not-JS and not-Small, reasons, but I'm just sharing my perspective on this topic. Thanks for sharing your project!


I recently released a Chrome extension that converts webpages to PDF. It's free, but you need to register to get a key. Unfortunately, this solution isn't client-side JavaScript; I'm using an API underneath. To be honest, I mainly created it to promote the API, but if it's useful for people, I might develop it further. Perhaps it could be useful to you in some way. I don't know your requirements, but maybe with this base in the form of this extension, it wouldn't be difficult to add something that meets your expectations, let me know. However, if you want to export a PDF from Ahrefs, for example, I'm afraid that might not be possible; currently, only basic authentication is supported. Unless maybe I could add an option like in my API to pass JavaScript code, but I also doubt that would work because Ahrefs probably has some bot protection.

edit: i forgot the link https://chromewebstore.google.com/detail/pdfbolt-web-to-pdf/...


Thanks for sharing that. Looks pretty nice!


You can find detailed search (including title/content) using the bookmarks endpoint. Snapshot search is currently only for finding multiple snapshots for a single URL/domain. Probably it should be emphasized, or content based search should be available there as well. Thanks for the feedback!


Scraping JS-only sites is also possible without a headless browser, but requires a bit more debugging of the internal structure of these sites. Most of JS-only websites have API endpoints with JSON responses, which can make scraping more reliable than parsing custom (and sometimes invalid) HTML. The drawback of headless browser based scraping is that it requires significant amount of cpu time and memory compared to "static" scraping frameworks.


Interesting idea, how do you imagine a channel based API for this?


I would ignore the GP's advice. Channels are prone to big errors -- panics and blocking -- which aren't detectable at compile time. They make sense to use internally but shouldn't be exposed in a public API. As one example, notice how the standard library's net/http package doesn't require you to use channels, but it uses them internally.


Would this work?

  c := colly.NewCollector()

  // this functions create a goroutine and returns a channel
  ch := c.HTML("a")  
  e := <- ch
  link := e.Attr("href")
  // ...
I'm a bit rusty (ah!) with go, so bear with me if the above contains errors.


How do you recognize if the collector has finished? If the site doesn't contain "a" elements (e.g. because of a network error), this example would block forever.


The producer closes the channel. This is differentiable from an open empty channel in a select.


Makes sense, thanks =)

In the above example this would require `nil` checking of the retrieved value every time. I'm not sure if it would make the API cleaner


This would work. No callback hell a pleasure for eyes!


Actually, I wrote this tool to make searx's engine development easier. It is glad to see, that so many people find it useful. =)


Thank you for both! :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: