3. Although we don't yet handle pdfs, the rest of the flow is one we aiming to accomplish with crul. Other than pdf, the pieces should be there and would love to understand this further
1: It does expand a level of hierarchy if you already know what you're looking for (from manually getting the data). Is there a way to omit columns that have more levels or keep them as list columns?
2: Probably too niche to be worth it for you :)
3: Parsing pdfs automagically isn't easy, but handling downloads and images and storing them in a bucket would go quite far (I guess there's a way to do that, but I didn't immediately see a "simple" example).
It sounds like it's potentially a great tool, the only open question is really, is it worth studying the docs and implementing a process in crul as opposed to whatever language I'm familiar with already?
We certainly have, it is admittedly quite new to both of us, so we have been exploring the best way to introduce something like this, as well as tying up technical bits. The open core model with commercial features is certainly appealing.
I've been involved with successful commercial projects using the open core model. Feel free to reach out over email if you have any questions. My contact info is in the profile.
I'm having trouble replicating. Although I did get a prompt from Safari to allow downloads from crul.com. The screenshot looks like it is redirecting from the button, is that what you are seeing - rather than the button triggering a download?
This requires me to know the HTML structure in advance. I want to do it on any page. Mozilla has readability JS which does it, wanted to know if this tool has the same feature. BTW it's a great tool.
Thank you! The query below would get you the full page text, although it likely wont be too legible. I'll read up on readability JS some, it looks quite magic and could possibly be added in.
Often you can use crul to discover the html structure in the results table. With a `find "text string"` to filter rows and then a filter on the column values that identify the desired elements.
Thanks for this, we're still trying to figure out these details ourselves.
Related to the quote, we've seen interest for API data wrangling, where prepackaged data feeds can be cumbersome to edit, or other implementation details become challenging like credential management, domain throttling, scheduling, checkpointing, export, etc.
It's also interesting for webpage data when you need to use a particular page as an index of links to filter and crawl. We've tried to build an abstraction layer around that.
Initially, we were mainly focused on webpages, and wanted to bypass the visuals of the browser and use a headless browser to fulfill network requests, render js, etc. then convert the page to flat table of enriched elements. With the APIs as another data set there is some work for us to do around language.
We're now trying to figure out what workflows are most relevant for crul to optimize around, as well, honestly - we just built what we thought was cool. Some features/workflows will certainly be more straightforward with existing tools and software - especially for a technically savvy user.
1. Although it has limitations, were you able to try the normalize command? https://www.crul.com/docs/queryconcepts/api-normalization
2. Will need to think about this some more.
3. Although we don't yet handle pdfs, the rest of the flow is one we aiming to accomplish with crul. Other than pdf, the pieces should be there and would love to understand this further