Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Is it significant that token length of source is close to e?
1 point by keepamovin 11 days ago | hide | past | favorite | 5 comments
What's the relationship between Shannon entropy of the distribution, and token length? For example, English is quoted as having a token length of ~4 characters, but source code (that I've tested) seems to be closer to 2.7. Is this significant that it's close to e (i.e., the base of the natural logarithm)? Is source code a more efficient and natural representation of structure/knowledge/information than English? Any thoughts? Any connection with how log appears in thermodynamic entropy?




My guess is that you are not naming your variables correctly.

*Based on file size / token count (as reported by Claude).

Does token include operators like + - * / ?

Does the average length include the space?


Er, sir, Idk. Ask anthropic.

not too many comments either



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: