Author of HumanifyJS here! I've created specifically a LLM based tool for this, ...

thomassmith65 · on Aug 29, 2024

Would it be difficult to add a 'rename from scratch' feature? I mean a feature that takes normal code (as opposed to minified code) and (1) scrubs all the user's meaningful names, (2) chooses names based on the algorithm and remaining names (ie: the built-in names).

Sometimes when I refactor, I do this manually with an LLM. It is useful in at least two ways: it can reveal better (more canonical) terminology for names (eg: 'antiparallel_line' instead of 'parallel_line_opposite_direction'), and it can also reveal names that could be generalized (eg: 'find_instance_in_list' instead of 'find_animal_instance_in_animals').

jehna1 · on Aug 29, 2024

Yes, I think you could use HumanifyJS for that. The way it works is that:

1. I ask LLM to describe what the meaning of the variable in the surrounding code

2. Given just the description, I ask the LLM to come up with the best possible variable name

You can check the source code for the actual prompts:

https://github.com/jehna/humanify/blob/eeff3f8b4f76d40adb116...

firtoz · on Aug 29, 2024

More tools should be built on ASTs, great work!

I'm still waiting for the AST level version control tbh

jansvoboda11 · on Aug 29, 2024

Unison supposedly has an AST-aware version control system: https://www.unison-lang.org/

LoganDark · on Aug 29, 2024

content-addressed too, I think!

timcobb · on Aug 29, 2024

Wow this looks so cool.

rightonbrother · on Aug 29, 2024

Smalltalk envy source controll

sebstefan · on Aug 29, 2024

What kind of question does it ask the LLM? Giving it a whole function and asking "What should we rename <variable 1>?" repeatedly until everything has been renamed?

Asking it to do it on the whole thing, then parsing the output and checking that the AST still matches?

jehna1 · on Aug 29, 2024

For each variable:

1. It asks the LLM to write a description of what the variable does

2. It asks for a good variable name based on the description from 1.

3. It uses a custom Babel plugin to do a scope-aware rename

This way the LLM only decides the name, but the actual renaming is done with traditional and reliable tools.

dotancohen · on Aug 30, 2024

This answer is reassuring.

Based on it, I went and read the readme. The readme was also excellent, and answered every question I had. Great job, thank you, I'll be trying this.

thrdbndndn · on Aug 29, 2024

Does it work with huge files? I'm talking about something like 50k lines.

Edit: I'm currently trying it with a mere 1.2k JS file (openai mode) it's only 70% done after 20 minutes. Even if it works therodically with 50k LOC file, I don't think you should try.

jehna1 · on Aug 29, 2024

It does work with any sized file, although it is quite slow if you're using the OpenAI API. HumanifyJS works so it processes each variable name separately, and keeps the context size manageable for an LLM.

I'm currently working on parallelizing the rename process, which should give orders of magnitude faster processing times for large files.

kingsloi · on Aug 29, 2024

It has this in the README

> Large files may take some time to process and use a lot of tokens if you use ChatGPT. For a rough estimate, the tool takes about 2 tokens per character to process a file:

> echo "$((2 * $(wc -c < yourscript.min.js)))" > So for refrence: a minified bootstrap.min.js would take about $0.5 to un-minify using ChatGPT.

> Using humanify local is of course free, but may take more time, be less accurate and not possible with your existing hardware.

thrdbndndn · on Aug 29, 2024

This only talks about the cost.

I'm more concerned about if it can actually deobfuscate such large file (context) and generate useful results.

punkpeye · on Aug 29, 2024

Looks useful! I will update the article to link to this tool. Thanks for sharing!

jehna1 · on Aug 29, 2024

Super, thank you for adding the link! It really helps to get people to find the tool

cryptoz · on Aug 29, 2024

Finally someone else using ASTs while working with LLMs and modifying code! This is such an under-utilized area. I am also doing this with good results: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...

jehna1 · on Aug 29, 2024

Super interesting! Since you're generating code with LLMs, you should check out this paper:

https://arxiv.org/pdf/2405.15793

It uses smart feedback to fix the code when LLMs occasionally do hiccups with the code. You could also have a "supervisor LLM" that asserts that the resulting code matches the specification, and gives feedback if it doesn't.

zamadatix · on Aug 29, 2024

It's a shame this loses one of the most useful aspects of LLM un-minifying - making sure it's actually how a person would write it. E.g. GPT-4o directly gives the exact same code (+contextual comments) with the exception of writing the for loop in the example in a natural way:

    for (var index = 0; index < inputLength; index += chunkSize) {

Comparing the ASTs is useful though. Perhaps there's a way to combine the approaches - have the LLM convert, compare the ASTs, have the LLM explain the practical differences (if any) in context of the actual implementation and give it a chance to make any changes "more correct". Still not guaranteed to be perfect but significantly more "natural" resulting code.

ouraf · on Aug 30, 2024

Depends on how many tokens you want to spend.

Making the code, fully commenting it and also giving an example after that might cost three times as much

strictnein · on Aug 30, 2024

As someone who has spent countless hours and days deobfuscating malicious Javascript by hand (manually and with some scripts I wrote), your tool is really, really impressive. Running it locally on a high end system with a RTX 4090 and it's great. Good work :)

boltzmann-brain · on Aug 29, 2024

how do you make an LLM work on the AST level? do you just feed a normal LLM a text representation of the AST, or do you make an LLM where the basic data structure is an AST node rather than a character string (human-language word)?

WhitneyLand · on Aug 29, 2024

The frontier models can all work with both source code and ASTs as a result of their standard training.

Knowing this raises the question, which is better to feed an LLM source code of ASTs?

The answer is really it depends on the use case, there are tradeoffs. For example keeping comments intact possibly gives the model hints to reason better. On the other side, it can be argued that a pure AST has less noise for the model to be confused by.

There are other tradeoffs as well. For example, any analysis relating to coding styles would require the full source code.

dunham · on Aug 29, 2024

It looks like they're running `webcrack` to deobfuscate/unminify and then asking the LLM for better variable names.

jehna1 · on Aug 29, 2024

I'm using both a custom Babel plugin and LLMs to achieve this.

Babel first parses the code to AST, and for each variable the tool:

1. Gets the variable name and surrounding scope as code

2. Asks the LLM to come up with a good name for the given variable name, by looking at the scope where the variable is

3. Uses Babel to make the context-aware rename to AST based on the LLM's response

bgirard · on Aug 29, 2024

How well does it compare to the original un-minified code if you compare it against minify + humanify. Would be neat if it can improve mediocre code.

jehna1 · on Aug 29, 2024

On structural level it's exactly 1-1: HumanifyJS only does renames, no refactoring. It may come up with better names for variables than the original code though.

Thorrez · on Aug 30, 2024

Can it guarantee 1-1? Doesn't Javascript allow looking up fields using a string name? That string could be computed in a complex manner.

j4k0xb · on Aug 30, 2024

It does in fact change the structure, but only safe-ish AST transformations related to minifiers (e.g. `void 0` to `undefined`): - https://github.com/jehna/humanify/blob/eeff3f8b4f76d40adb116... - https://webcrack.netlify.app/docs/concepts/unminify.html

properties and strings aren't renamed

fny · on Aug 29, 2024

Is it possible to add a mode that doesn't depend on API access (e.g. copy and paste this prompt to get your answer)? Or do you make roundtrips?

jehna1 · on Aug 29, 2024

There is a fully local mode that does not use ChatGPT at all – everything happens on your local machine.

API access of ChatGPT mode is needed as there are many round trips and it uses advanced API-only tricks to force the LLM output.

KolmogorovComp · on Aug 29, 2024

Thanks for your tool. Have you been able to quantify the gap between your local model and chatgpt in terms of ‘unminification performance’?

jehna1 · on Aug 29, 2024

At the moment I haven't found good ways of measuring the quality between different models. Please share if you have any ideas!

For small scripts I've found the output to be very similar between small local models and GPT-4o (judging by a human eye).

anticensor · on Aug 29, 2024

Thanks for creating this megafier, can you add support for local LLMs?

jehna1 · on Aug 29, 2024

Better yet, it already does have support for local LLMs! You can use them via `humanify local`

benreesman · on Aug 29, 2024

Came here to say Humanify is awesome both as a specific tool and in my opinion a really great way to think about how to get the most from inherently high-temperature activities like modern decoder nucleus sampling.

+1