More

portInit · on March 1, 2023

Appreciate you taking the time to play around with crul and share your thoughts! They're incredibly valuable.

1. Although it has limitations, were you able to try the normalize command? https://www.crul.com/docs/queryconcepts/api-normalization

2. Will need to think about this some more.

3. Although we don't yet handle pdfs, the rest of the flow is one we aiming to accomplish with crul. Other than pdf, the pieces should be there and would love to understand this further

djhn · on March 1, 2023

1: It does expand a level of hierarchy if you already know what you're looking for (from manually getting the data). Is there a way to omit columns that have more levels or keep them as list columns?

2: Probably too niche to be worth it for you :)

3: Parsing pdfs automagically isn't easy, but handling downloads and images and storing them in a bucket would go quite far (I guess there's a way to do that, but I didn't immediately see a "simple" example).

It sounds like it's potentially a great tool, the only open question is really, is it worth studying the docs and implementing a process in crul as opposed to whatever language I'm familiar with already?

portInit · on March 1, 2023

We certainly have, it is admittedly quite new to both of us, so we have been exploring the best way to introduce something like this, as well as tying up technical bits. The open core model with commercial features is certainly appealing.

Open to any perspectives on this.

imiric · on March 1, 2023

I've been involved with successful commercial projects using the open core model. Feel free to reach out over email if you have any questions. My contact info is in the profile.

portInit · on March 1, 2023

Thank you! Will make this more clear.

portInit · on March 1, 2023

Thanks for the heads up. Could you share what browser you are using? And are you downloading from https://www.crul.com/account?

Brajeshwar · on March 1, 2023

Yes, trying to download from inside the Crul account after I successfully signed in. I'm using Safari 16.3 on macOS 13.2.1.

portInit · on March 1, 2023

I'm having trouble replicating. Although I did get a prompt from Safari to allow downloads from crul.com. The screenshot looks like it is redirecting from the button, is that what you are seeing - rather than the button triggering a download?

portInit · on March 1, 2023

You would need to do a bit filtering to get the exact text you need. For example, to get the text for this post you could run:

open https://news.ycombinator.com/item?id=34970917 || filter "(attributes.class == 'toptext')" || table innerText

zerop · on March 1, 2023

This requires me to know the HTML structure in advance. I want to do it on any page. Mozilla has readability JS which does it, wanted to know if this tool has the same feature. BTW it's a great tool.

portInit · on March 1, 2023

Thank you! The query below would get you the full page text, although it likely wont be too legible. I'll read up on readability JS some, it looks quite magic and could possibly be added in.

open https://news.ycombinator.com/item?id=34970917 || filter "(nodeName == 'BODY')" || table innerText

Often you can use crul to discover the html structure in the results table. With a `find "text string"` to filter rows and then a filter on the column values that identify the desired elements.

portInit · on March 1, 2023

That's unfortunate. A quick look suggests ISP but maybe not with the hotspot. Will do a little more digging.

ElijahLynn · on March 1, 2023

Thanks, once I got back to my home network it worked (same browser). So definitely looks ISP-related.

portInit · on March 1, 2023

Thank you! Docker image has everything for functionality, there's just an update check that pings our end to check for updates.

portInit · on Feb 28, 2023

Thanks for this, we're still trying to figure out these details ourselves.

Related to the quote, we've seen interest for API data wrangling, where prepackaged data feeds can be cumbersome to edit, or other implementation details become challenging like credential management, domain throttling, scheduling, checkpointing, export, etc.

It's also interesting for webpage data when you need to use a particular page as an index of links to filter and crawl. We've tried to build an abstraction layer around that.

Initially, we were mainly focused on webpages, and wanted to bypass the visuals of the browser and use a headless browser to fulfill network requests, render js, etc. then convert the page to flat table of enriched elements. With the APIs as another data set there is some work for us to do around language.

We're now trying to figure out what workflows are most relevant for crul to optimize around, as well, honestly - we just built what we thought was cool. Some features/workflows will certainly be more straightforward with existing tools and software - especially for a technically savvy user.

portInit · on Feb 28, 2023

Ah! We didn't quite get an html table command in this release but it will be in the next one.

Here's a query that shows an option, but the table command will be far more straightforward.

open https://www.w3schools.com/html/html_tables.asp --dimension || filter "(nodeName == 'TD')" || groupBy boundingClientRect.top || table _group.0.innerText _group.1.innerText _group.2.innerText

portInit · on Feb 28, 2023

Thank you! Really means a lot to us