Author of HumanifyJS here! I've created specifically a LLM based tool for this, which uses LLMs on AST level to guarantee that the code keeps working after the unminification step:
Would it be difficult to add a 'rename from scratch' feature? I mean a feature that takes normal code (as opposed to minified code) and (1) scrubs all the user's meaningful names, (2) chooses names based on the algorithm and remaining names (ie: the built-in names).
Sometimes when I refactor, I do this manually with an LLM. It is useful in at least two ways: it can reveal better (more canonical) terminology for names (eg: 'antiparallel_line' instead of 'parallel_line_opposite_direction'), and it can also reveal names that could be generalized (eg: 'find_instance_in_list' instead of 'find_animal_instance_in_animals').
What kind of question does it ask the LLM? Giving it a whole function and asking "What should we rename <variable 1>?" repeatedly until everything has been renamed?
Asking it to do it on the whole thing, then parsing the output and checking that the AST still matches?
Does it work with huge files? I'm talking about something like 50k lines.
Edit: I'm currently trying it with a mere 1.2k JS file (openai mode) it's only 70% done after 20 minutes. Even if it works therodically with 50k LOC file, I don't think you should try.
It does work with any sized file, although it is quite slow if you're using the OpenAI API. HumanifyJS works so it processes each variable name separately, and keeps the context size manageable for an LLM.
I'm currently working on parallelizing the rename process, which should give orders of magnitude faster processing times for large files.
> Large files may take some time to process and use a lot of tokens if you use ChatGPT. For a rough estimate, the tool takes about 2 tokens per character to process a file:
> echo "$((2 * $(wc -c < yourscript.min.js)))"
> So for refrence: a minified bootstrap.min.js would take about $0.5 to un-minify using ChatGPT.
> Using humanify local is of course free, but may take more time, be less accurate and not possible with your existing hardware.
It uses smart feedback to fix the code when LLMs occasionally do hiccups with the code. You could also have a "supervisor LLM" that asserts that the resulting code matches the specification, and gives feedback if it doesn't.
It's a shame this loses one of the most useful aspects of LLM un-minifying - making sure it's actually how a person would write it. E.g. GPT-4o directly gives the exact same code (+contextual comments) with the exception of writing the for loop in the example in a natural way:
for (var index = 0; index < inputLength; index += chunkSize) {
Comparing the ASTs is useful though. Perhaps there's a way to combine the approaches - have the LLM convert, compare the ASTs, have the LLM explain the practical differences (if any) in context of the actual implementation and give it a chance to make any changes "more correct". Still not guaranteed to be perfect but significantly more "natural" resulting code.
As someone who has spent countless hours and days deobfuscating malicious Javascript by hand (manually and with some scripts I wrote), your tool is really, really impressive. Running it locally on a high end system with a RTX 4090 and it's great. Good work :)
how do you make an LLM work on the AST level? do you just feed a normal LLM a text representation of the AST, or do you make an LLM where the basic data structure is an AST node rather than a character string (human-language word)?
The frontier models can all work with both source code and ASTs as a result of their standard training.
Knowing this raises the question, which is better to feed an LLM source code of ASTs?
The answer is really it depends on the use case, there are tradeoffs. For example keeping comments intact possibly gives the model hints to reason better. On the other side, it can be argued that a pure AST has less noise for the model to be confused by.
There are other tradeoffs as well. For example, any analysis relating to coding styles would require the full source code.
On structural level it's exactly 1-1: HumanifyJS only does renames, no refactoring. It may come up with better names for variables than the original code though.
Came here to say Humanify is awesome both as a specific tool and in my opinion a really great way to think about how to get the most from inherently high-temperature activities like modern decoder nucleus sampling.
https://github.com/jehna/humanify