Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: CLI for generating PDFs for offline reading (github.com/dvcoolarun)
161 points by dvcoolarun on Feb 5, 2024 | hide | past | favorite | 43 comments
I've always thought that extensive reading was best suited for the realm of paper. As a result, I've created a command-line interface (CLI) tailored for my own use and decided to make it open source. I welcome any feedback you may have.

[Edit] Sample PDF :: https://drive.google.com/file/d/1n7M1TKOptSsYiibrbvV_Yojx53T...



I feel like if you are claiming "beautiful" output then it's obligatory to have at the very least screenshots of said output PDFs (or better yet, a sample for the same link in the CLI screenshot, especially so people can see how the text flows, what quality images are captured at, how text can be selected, etc).


Arr! Just updated the post with a sample PDF.


I'd strongly suggest to include the sample PDF screenshot in the Github readme as well.

Also just my opinion and honest feedback but I don't really consider that PDF "beautiful" or readable. However, your project is a very good start toward something that is potentially very useful.

Here are my suggestions:

- Switch to a modern, beautiful, readable serif font that isn't an overused (Times New Roman, Garamond, and pretty much any MS fonts are overused and will make your PDF end up looking like "yet another Word document"). Some good options that come to mind are Tiempos Fine, Butler, Lora, but there are lots of good ones out there. Shop around for something both unique AND highly readable. Make sure your PDF generator embeds the font or (preferably) converts all text to paths instead of just referencing the font.

Also, know the difference between fonts designed for printing and screen reading and fonts designed for both, and adjust according to whatever use case you're marketing.

- Figure out how to make the sub-headers heavier weight and significantly larger to stand out. Be bold with your font size changes. Printed documents look "nice" when the sections are clearly laid out and EASY for the eye to parse from a distance how the document is organized and which are top level headings and which are subheadings under those headings. For generation purposes there's a good chance you can probably just parse it from <h1> <h2> <h3> ... tags in the HTML, unless the website uses some JS-framework-of-the-month bullshit. Use more white space before each header to physically separate it. Avoid using lines between a header and the text it corresponds to, as that can be distracting to the eye since lines are normally thought of as separating sections.

I'd give you a pull request but I'm super busy these days.


Just one thing to add: Reduce the line length and make the font slightly bigger.

Maybe increase the line height a bit.


Is it just me or does the sample not look that beautiful?


Kinda looks like the script just grabbed the <p>'s and did a copy paste. Without much thought to format


Correct, it is only you who looks that beautiful


Yeah, that would make this post a lot better.


Agreed. I would offer a PDF version of the project's readme file as a demo.

OP: this is important. There are a million tools to generate PDF files, most of them don't produce nice looking PDFs.


This is cool! I have a HN pipeline where I upvote things that I want to drill into, and a script I wrote generates PDFs and sends to my Kindle for offline reading (great for my pipeline). That uses Playwright's "to PDF" method which is over the browser and slow. I might look into replacing with this.

If there's any interest I might OSS the pipeline


PDF sounds like a terrible format for reading on the kindle, why not MOBI or epub or whatever it uses?


I wonder if Pandoc could do this well.

Edit: Ah, there was a thread(1) with related examples further down. And Pandoc can indeed do epub output.

1 - https://news.ycombinator.com/item?id=39268147


I was thinking the same thing. Why wouldn't this project convert to epub which is easier to pack and much easier to read?


This sounds like an excellent idea!


Could you share more details about this? The tools / flow?


We just use a headless chrome with a sort of wrapper script to do this at my work with a bunch of settings close to the actual size of paper. It allows me to test all of our reports in media->print in dev tools then print->pdf with chrome and only have to design to that spec. Then in our reports we provide a "save as pdf" button instead of encouraging print in all the other possible browsers which would make the task insane and cause me to possibly quit.


Apologies for the oversight; I forgot to include the screenshot of the sample PDF. Here it is for your reference: https://drive.google.com/file/d/1n7M1TKOptSsYiibrbvV_Yojx53T...


What was the website source?



Arr, this blew up! I think, in some form, people are missing the context of the script. It's a plug-and-play script where you can make changes to PDF quality using CSS/Python. Even fonts are loaded through Google in Python. 'Beautiful' is called contextual. You can create your own version and share it with the community.

I'm on mobile, so I can't add a Google Drive file screenshot to the readme, and iframes are not supported.


like this:

  sudo apt install pandoc wkhtmltopdf
  npm install -g readability-cli
  pandoc -s https://www.paulgraham.com/avg.html -o output.html && readable output.html -o readable.html && wkhtmltopdf readable.html output.pdf &&  open output.pdf
going even further using bash script to prompt for url.

  #!/bin/bash

  # Prompt the user for a URL
  read -p "Enter the URL: " URL

  # Use the URL in the pandoc command
  pandoc -s $url -o output.html && readable output.html -o readable.html && wkhtmltopdf readable.html output.pdf && open output.pdf


  chmod +x web2pdf.sh
  # add an alias to bashrc
  alias web2pdf='/path/to/your/web2pdf.sh'
  source ~/.bashrc


Or you could just import weasyprint and readability in python!

Which is pretty much what OP did, with a few nice additions like nice output and some custom css and fonts. Which is nice and OP hope brings value to people.

If you add a few printf:s and include a nice template for pandoc* it would be more along the same thing.

*) Neither pandoc, wkhtmlpdf or weasyprint have (in my opinion) bad default templates.


Pandoc can also output to pdf if you have latex installed. You can use tectonic if you want a single binary solution than a full latex toolchain, it downloads necessary packages on the fly.


Pandoc can also do epub output: https://pandoc.org/epub.html


wkhtmltopdf seems to be archived on github[1]. Can't say I want to run an unpatched, outdated browser engine for this.

[1]: https://github.com/wkhtmltopdf/wkhtmltopdf


Very interesting! One piece of feedback: it would probably be more useful to have a screenshot of the PDF on your README rather than one of the CLI. Also, do you intend to release this as FOSS?


Both Chrome and Firefox have absolutely horrible "Print" (to PDF) commands, which render the Web pages in a different way than what they show on the screen, and which results in large parts of the page being obscured by ads, menus, headers, etc., or in parts of the Web page that are outside the rendered area, so they are missing, or in content that is compressed to a small part of the output pages.

It would be really nice if there existed a utility able to produce a PDF file where the Web pages are rendered as well as the browsers render them on the screen, without becoming confused even by complex scripts loaded by the page.

The alternatives to "Print" (producing a PDF) are even worse. A screenshot has limited resolution and it loses the text. In the past "Save as ..." was the normal solution, but now even if you save a "complete" page, it will still frequently include scripts that will no longer work offline. What I want to save are the pages perfectly rendered as they were at that instant, without any scripts that could make them appear differently in the future.


You should try SingleFile, see https://github.com/gildas-lormeau/SingleFile


FTA: “Then you can use the tool as follows

  pipenv shell
  pipenv install
  python main.py https://www.paulgraham.com/avg.html, https://www.paulgraham.com/determination.html
Just add the webpage URLs separated by commas”

What’s the rationale for “separated by commas”? The convention for CLI arguments is to use one argument per input file.


Also commas are valid characters in an URL, so they should not be used as a separator for URLs.

If you want to support multiple input URLs just pass all of them in, one URL per element in sys.argv


  % python main.py https://www.paulgraham.com/avg.html
  Traceback (most recent call last):
  File "/Users/bill/web2pdf/main.py", line 7, in <module>
    from readability import Document
  ImportError: cannot import name 'Document' from 'readability' 
  (/Users/bill/.local/share/virtualenvs/web2pdf- 
 gXeVRXKg/lib/python3.9/site-packages/readability/__init__.py)
But according to your Pipfile.lock, the readability module needed is 0.3.1:

  "readability": {
            "hashes": [
              "sha256:f9030df8bc31aad45baffa9a2d9ce1fdd8051833e5b5bda3027df32fdec00fad"
            ],
            "index": "pypi",
            "version": "==0.3.1"
        },
Version 0.3.1 of the module "readability" exists, but does not appear to have a class "Document".


I think the pipfile should specify only the ‘readability-lxml’ package and not ‘readibility’, which does unrelated things.


Apropos of nothing, I added this function so I don't have to leave the command line to see the PDF.

   pdfpage() {
     convert -resize 0x1000^ "${1}"[${2}] -background white -flatten sixel:-
   }
You can probably deduce it assumes you have a Imagemagick installed and you're in a terminal with sixel support.


Somewhat similarly, I wrote a web app to generate epub (instead of pdf) out of urls and send to eink reader(s) directly (via a telegram bot) so I can read them. Currently it supports sending epub by email (for kindle) or uploading epub to dropbox (for kobo, etc.). It originally also supports reMarkable cloud but we can no longer make reMarkable cloud actually work. There's also a REST api to generate epub to be downloaded directly: https://github.com/fishy/url2epub/blob/main/REST.md

For e-ink readers epubs are generally better than PDFs for urls anyways, as epubs are basically packed htmls, and also the flow text works better on smaller screens.


Parhaps add ublock filters support? I use it to strip down any unwanted elements on page before printing. On hacker news discussions it removes forms, reply links, header and footers...


For print or PDF, I like multi-column newspaper style, as created by this extension: https://chromewebstore.google.com/detail/simple-print/nalmbm...

One benefit of using a Chrome extension (vs. CLI) is that it's easy to 'print' things that require authentication.


Have you compared it with a conversion by pandoc (https://pandoc.org/)?


Came for this. Whole thing would look nicer by just using markdownload -> pandoc. This is a solved problem and the existing solution seems more elegant. It’s cool to do projects tho.


Does it run a headless chrome for pixel perfect formatting as laid out as a webpage and applied in that format to PDF ignoring the pages print css rules? Cuz, that would be a nice start. And an option for size to be pixel width based for ideal layout... Because I won't be printing, I will be viewing on my phone, so one overly large page would be perfect.


Webbrowser opens url -> print -> save as/to pdf?

I'm sure I'm missing something, what is a cli interface buying me here?


Very cool! in README.md is that an extra 'p' in Webp2pdf ?


Can you add comparison pdfs generated by pandoc and gotenberg?


Found some potential bugs. Please check the github issues page.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: