Please share some examples of webpages with data to be wrangled that support thi...

portInit · on Feb 28, 2023

Thanks for this, we're still trying to figure out these details ourselves.

Related to the quote, we've seen interest for API data wrangling, where prepackaged data feeds can be cumbersome to edit, or other implementation details become challenging like credential management, domain throttling, scheduling, checkpointing, export, etc.

It's also interesting for webpage data when you need to use a particular page as an index of links to filter and crawl. We've tried to build an abstraction layer around that.

Initially, we were mainly focused on webpages, and wanted to bypass the visuals of the browser and use a headless browser to fulfill network requests, render js, etc. then convert the page to flat table of enriched elements. With the APIs as another data set there is some work for us to do around language.

We're now trying to figure out what workflows are most relevant for crul to optimize around, as well, honestly - we just built what we thought was cool. Some features/workflows will certainly be more straightforward with existing tools and software - especially for a technically savvy user.

nine_k · on Feb 28, 2023

Do you have an easy way to transform e.g. a typical Amazon product page into nicely structured data? Not that it's trying to be very video-gamey.

1vuio0pswjnm7 · on March 1, 2023

What should the structure look like. If it is CSV what are the columns, i.e., what specific data does it need include.

Taking a quick look at the Amazon site these product pages appear to be enormous in size. Interestingly, the website requires a "viewport-width" header. Otherwise one gets directed to a CAPTCHA.

The product page I checked already has some structered data in the form of JSON, including keys such as

   "title":"xxxxxxxxxx"
   "displayPrice":"$000.00"
   "priceAmount":000.00
   "currencySymbol":"$"
   "integerValue":"000"
   "decimalSeparator":"."
   "fractionalValue":"00"
   "symbolPosition":"left"
   "asin": "xxxxxxxxx"
   "asin":"xxxxxxxxxx"
   "acAsin":"xxxxxxxxxx"
   "buyingOptionTypes":["NEW"]
   "productAsin":"xxxxxxxxxx"
   "mediaAsin":"xxxxxxxxxxx"
   "parentAsin":"xxxxxxxxx"
   "asinList":"xxxxxxxxxx"

Thus, CSV with product name, price and ASIN would appear to be easy. No need to mess with the HTML.

Other data such as, e.g., delivery time, seller, where the item ships from and number left in stck can be extracted from the HTML.

Delivery time is in a <span> that contains "data-csa-c-delivery-time".

Seller, shipping info and number left in stock are under a <span> with class="a-size-base _p13n-desktop-sims-fbt_fbt-desktop_shipping-info-show-box__17yWM"

One needs to decide what data one wants from the page.

The way to present an example on which to evaluate a "new" solution such as the one in this thread is to present a problem, e.g.,

Get data items x, y and z from website xyzexample.com.

In the majorty of cases I see submitted to HN, it is impossible to benchmark these "new" solutions against existing ones because no example websites are ever provided.

nine_k · on March 1, 2023

The output may be a collection of CSV files, or a JSON file with nicely structured data, because the page certainly has a pretty visible structure, with various data blocks.