Are we sure that unrestricted free-form Markdown content is the best configuration format for this kind of thing? I know there is a YAML frontmatter component to this, but doesn't the free-form nature of the "body" part of these configuration files lead to an inevitably unverifiable process?
I would like my agents to be inherently evaluable, and free-text instructions do not lend themselves easily to systematic evaluation.
>doesn't the free-form nature of the "body" part of these configuration files lead to an inevitably unverifiable process?
The non-deterministic statistical nature of LLMs means it's inherently an "inevitably unverifiable process" to begin with, even if you pass it some type-checked, linted, skills file or prompt format.
Besides, YAML or JSON or XML or free-form text, for the LLM it's just tokens.
At best you could parse the more structured docs with external tools more easily, but that's about it, not much difference when it comes to their LLM consumption.
The modern state of the art is inherently not verifiable. Which way you give it input is really secondary to that fact. When you don't see weights or know anything else about the system, any idea of verifiability is an illusion.
Sure. Verifiability is far-fetched. But say I want to produce a statistically significant evaluation result from this – essentially testing a piece of prose. How do I go about this, short of relying on a vague LLM-as-a-judge metric? What are the parameters?
You 100% need to test work done by AI, if it's code it needs to pass extensive tests, if it's just a question answered, it needs to be the common conclusion of multiple independent agents. You can trust a single AI as much as a HN or reddit comment, but you can trust a committee of 4 as a real expert.
More generally I think testing AI by using its web search, code execution and ensembling is the missing ingredient to increased usage. We need to define the opposite of AI work - what validates it. This is hard, but once done you can trust the system and it becomes cheaper to change.
How would you evaluate it if the agent were not a fuzzy logic machine?
The issue isnt the LLM, its that verification is actually the hard part. In any case, its typically called “evals” and you can probably craft a test harness to evaluate these if you think about it hard enough
"if the input contents were parameterized and normalized to some agreed-upon structure"
Just the format would be. There's no rigid structure that gets any preferrential treatment by the LLM, even if it did accept. In the end it's just instructions that are no different in any way from the prompt text.
And nothing stops you from making a "parameterized and normalized to some agreed-upon structure" and passing it directly to the LLM as skills content, or parsing it and dumping it as skills regular text content.
The DSPy + GEPA idea for this mentioned above[1] seems like it could be a reasonable approach for systematic evaluation of skills (not agents as a whole though). I'm going to give this a bit of a play over the holiday break to sort out a really good jj-vcs skill.
I love this.
I tried doing sub-pixel simulation for a tool I created (screenstab.com if anyone’s interested – yeah I know, shameless plug, etc.). I ended up abandoning the sub-pixel aspect in my shader because of the distracting patterns caused by the Moire effect.
Neat. Would be nice if there were some examples of what a beads rendition looks like. Maybe it's obvious for people in the game. I assume they are hexagonal?
While Figma can be a useful tool for aligning design with code, I think it's unrealistic to expect it to accommodate all the constraints of the web platform. Relying solely on a 1-to-1 mapping between Figma components and code components can be problematic and may not accurately reflect the complexities and nuances of web development.
> The designers should be working with the developers to implement their vision
I agree with the importance of this. I guess my gripe is with the fact that at the end of the day, the burden of formalizing anything that gets put on the web is on the shoulders of developers, even in the case of expressing design language, as this usually isn’t discernibly structured until the developer starts typing out code.
> I guess my gripe is with the fact that at the end of the day, the burden of formalizing anything that gets put on the web is on the shoulders of developers
Welcome to the world of system development. This has always been the case unless your customer is operating at a similar technical level and can formalize the requirements in your own language (or near enough). Your designers are able to formalize their requirements, but using a domain of discourse that your developers are unfamiliar with, and probably missing details your developers need because they, the designers, are unfamiliar with the domain of discourse your developers use. This always happens, no matter the field. Each group has their own domain language with its own notion and degree of formalization. As the developer, it falls on them to ensure their understanding is correct. The same would be said for non-software development efforts. An architect has the same problem with their customers, and a builder has the same problem with the architect.
It is certainly frustrating, but that frustration has to be overcome. Unless your customer (designers in your case) are intransigent and refuse to communicate when asked for clarification or refinement of details or feedback on a partial implementation, then this is a surmountable problem.
Acknowledging that formalization typically occurs in the development process, it's worth noting that it's often the developer who initiates and enforces it. However, it would be beneficial for designers to see formalization as an integral part of the design process itself, which could lead to more efficient collaboration between designers and developers, ultimately streamlining the entire development process.
OP here. I propose a concept to create a tool similar to Figma but focused on designing screen reader experiences. The goal is to encourage designers to formalize and structure their work in a way that would consider accessibility and user experience for those relying on screen readers.
This new tool would require designers to think about the semantics and hierarchy of the content, forcing them to consider not just the visual presentation, but also the underlying structure that screen readers rely on. By doing so, designers would have to make their design intentions more explicit and less open to interpretation by front-end developers.
I shared this idea with my coworkers and the reception was lukewarm.
It got a lukewarm reception because it doesn’t take into account how design works and how it adds value. Designers are not engineers. Forcing them to formalize early in the process makes them less efficient and hinders their ability to explore widely.
Figma has a lot of features like symbols, auto layout and design system support which can be used to introduce formality and structure.
Thanks for your comment! I'm glad it caught your attention and stood out as unique. Developing this has taken an inordinate amount of effort, so it's rewarding to see it recognised. I set out to create something different, and you made me feel like I accomplished that. Thanks for your support!
OP here. This really blew up. I actually made this back in 2021 (doesn't seem long ago), and probably tried posting it to HN back then, to no avail. I just posted it again on a whim, because I felt I was on a roll with my previous post on here about the metal skeuomorphism thingamajig (https://www.metalmorphism.com). If you liked this project, and want to see what I've been up to in my spare time lately, feel free to check out that discussion: https://news.ycombinator.com/item?id=34707160