Only a 100 years — the whole history before that was working in the vicinity of a home, it does feel natural to return to that. Instead of anvils, we hit keyboards and instead of swords produce alignment, but either way it brings food to the table and allows flexibility in work-life?
Not in tech but was a teacher for decades. My first teaching job in early 80s of the last century had a requirement that teachers live within 5 miles of the building.
In general; perhaps a return to guilds? Apprentices? In an area of my city that has a lot of small craft workshops (and, yes, a few have anvils) there are 'work-live' units being built that have workshops on the ground floor and living accommodation above.
On 2/ there is a safety point in having the grounding pin on all plugs, and it being longer than the live/neutral: in the socket side, the grounding pin opens up latches that block live/neutral, so kids can’t stick things into them..
I generally would agree it is the best plug standard for safety, but clunky and painful to step on..
A little page that tries to keep up with Flickr uploads in real time.. built ca 13 years ago, amazingly still running. http://ekke.si/flickr/?render=random
I have thought of that as ‘n’ being the manageable threshold and the (uncontrolled) ‘+1’ the overflow creates the problems. Typically in terms of additional layers or iterations.. but i like your point and perhaps’n+1’ and ‘1+n’ mean different problem shapes
I could see that making sense in some contexts, like the "straw that breaks the camel's back". However that's not what's usually being referred to in this database query problem.
Usually it's doing one query that returns n results, then doing one more query for each result. Therefore, you end up having done 1+n queries. If you'd used a join you could potentially have done only 1 query.
What I'd love to see is scraper builder that uses LLMs/'magic' to generate optimised scraping rules for any page, ie css selectors and processing rules mapped to output keys. So you can run scraping itself at low cost and high performance..
Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).
We currently use this at Magic Loops[2] and it works _most_ of the time.
The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).
Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.
This is essentially what we're building at https://reworkd.ai (YC S23). We had thousands of users try using AgentGPT (our previous product) for scraping and we learned that using LLMs for web data extraction fundamentally does not work unless you generate code.
Code is also hard. You got to generate code that accounts for all possible exceptions or errors. If you want to automate an UI for example, pushing a button can cause all sorts of feedback, errors, consequences that need to be known to write the code.
Here's a project that describes the use of llm to generate crawling rules and then capture them, but it looks like it's still in the early stages of research.
Most of the top LLM already do this very well. It's because they've been trained on web data, and also because they're being used for precisely this task internally to grab data.
The complicated ops of scraping is running headless browsers, IP ranges, bot bypass, filling captchas, observability and updating selectors, etc. There are a ton of SaaS services that do that part for you.
It seems also obvious that one would want to simply drag a box around the content you want, and the tool would just provide some examples to help you refine the rule set.
Ad blockers have had something very close to this for some time, without any sparkly AI buttons.
I’m sure someone would be working on a subscription based model using corporate models in the backend, but it’s something that could easily be implemented with a very small model.
That's an interesting take. I've been experimenting with reducing the overall rendered html size to just structure and content and using the LLM to extract content from that. It works quite well. But I think your approach might be more efficient and faster.
One fun mechanism I've been using for reducing html size is diffing (with some leniency) pages from same domain to exclude common parts (ie headers/footers). That preprocessing can be useful for any parsing mechanism..
Parsing html is a solved and frankly not a very interesting problem. Writing up xpath/css selectors or JSON parsers (for when data is in script variables) is not much of a challenge for anyone.
More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.
LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.