Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do you have an easy way to transform e.g. a typical Amazon product page into nicely structured data? Not that it's trying to be very video-gamey.


What should the structure look like. If it is CSV what are the columns, i.e., what specific data does it need include.

Taking a quick look at the Amazon site these product pages appear to be enormous in size. Interestingly, the website requires a "viewport-width" header. Otherwise one gets directed to a CAPTCHA.

The product page I checked already has some structered data in the form of JSON, including keys such as

   "title":"xxxxxxxxxx"
   "displayPrice":"$000.00"
   "priceAmount":000.00
   "currencySymbol":"$"
   "integerValue":"000"
   "decimalSeparator":"."
   "fractionalValue":"00"
   "symbolPosition":"left"
   "asin": "xxxxxxxxx"
   "asin":"xxxxxxxxxx"
   "acAsin":"xxxxxxxxxx"
   "buyingOptionTypes":["NEW"]
   "productAsin":"xxxxxxxxxx"
   "mediaAsin":"xxxxxxxxxxx"
   "parentAsin":"xxxxxxxxx"
   "asinList":"xxxxxxxxxx"
Thus, CSV with product name, price and ASIN would appear to be easy. No need to mess with the HTML.

Other data such as, e.g., delivery time, seller, where the item ships from and number left in stck can be extracted from the HTML.

Delivery time is in a <span> that contains "data-csa-c-delivery-time".

Seller, shipping info and number left in stock are under a <span> with class="a-size-base _p13n-desktop-sims-fbt_fbt-desktop_shipping-info-show-box__17yWM"

One needs to decide what data one wants from the page.

The way to present an example on which to evaluate a "new" solution such as the one in this thread is to present a problem, e.g.,

Get data items x, y and z from website xyzexample.com.

In the majorty of cases I see submitted to HN, it is impossible to benchmark these "new" solutions against existing ones because no example websites are ever provided.


The output may be a collection of CSV files, or a JSON file with nicely structured data, because the page certainly has a pretty visible structure, with various data blocks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: