Hacker Newsnew | past | comments | ask | show | jobs | submit | brittlewis12's commentslogin

chat completions is stateless — you must provide the entire conversation history with each new message; openai stores nothing (at least nothing that the downstream product _can use_) beyond the life of the request.

responses api, by contrast, is stateful — only send the latest message, and openai stores the conversation history, while keeping track of other details on behalf of the calling app, like parallel tool call states.

but i would say that since chat completions has become an informal industry standard, the responses api feels like an attempt by openai to break away from that shared interface, because it is so easy to swap out providers with nothing more than a base url and a model id, to a paradigm which requires data migration as well as replacement infrastructure (containers for code execution, for example).


one additional difference between chat and responses is the number model turns a single api call can make. chat completions is a single turn api primitive -- which means it can talk to the model just once. responses is capable of making multiple model turns and tool calls in a single api call.

for example, you can give the responses api access to 3 tools: a vector store with some user memories (file_search), the shopify mcp server, and code_interpreter. you can then ask it to look up some user memories, find relevant items in the shopify mcp store, and then download them into a csv file. all of this can be done in a single api call that involves multiple model turns and tool calls.

p.s. - you can also use responses statelessly by setting store=false.


What are my choices for using a custom tool? Does it come down to: function calling (single turn), MCP (multi-turn via Responses)? What else are my choices?

Why would anyone want to use Responses statelessly? Just trying to understand.


i think the original intent of responses api was also to unify the realtime experiences into responses - is that accurate?


we expect responses and realtime to be our 2 core api primitives long term — responses for turn by turn interactions and realtime for models requiring low latency bidirectional streams to/from the apps/models.


thank you for the correction!


This is very enlightening. You're right then, it does seem to partially be a strategic moat-building move by OpenAI


simple as two config options in `.zed/settings.json` (or `~/.config/zed/settings.json`):

``` { "telemetry": { "diagnostics": false, "metrics": false } } ```

for anyone who wants a fast, modern, resource-minimal, gpu-native text editor that is open source.


This should be opt-in or at least have a giant red confirmation dialog shown until the user agrees.


Sorry for the confusing experience, and thank you for sharing this!

I’ve just submitted a new update for review with a few small but hopefully noticeable changes, thanks in no small part to your feedback:

1. StableLM Zephyr 3b Q4_K_M is now the built-in model, replacing the Q6_K variant.

2. More aggressive RAM headroom calculation, with forced fallback to CPU rather than failing to load as you observed, or crashing outright in some nasty edge cases.

3. New status indicator for Metal when model is loaded (filled bolt for enabled, vs slashed bolt for disabled.)

4. Metal will now also be enabled for devices with 4GB RAM or less, but only when the selected model can comfortably fit in RAM. Previously, only devices with at least 6GB had Metal enabled.

Thank you so much for taking the time to test and share your experience! Feel free to reach out anytime at britt [at] bl3 [dot] dev.

Britt


TL;DR: No, nearly all these apps will use GPU (via Metal), or CPU, not Neural Engine (ANE).

Why? I suggest a few main reasons: 1) No Neural Engine API 2) CoreML has challenges modeling LLMs efficiently right now. 3) Not Enough Benefit (For the Cost... Yet!)

This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!

---

1) No Neural Engine API

- There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.

2) CoreML has challenges modeling LLMs efficiently right now.

- Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).

- CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.

- HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.

3) Not Enough Benefit (For the Cost... Yet!)

- ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).

- So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.

- Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.

I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!

Britt

---

[1] https://github.com/huggingface/exporters/pull/37

[2] https://apple.github.io/coremltools/docs-guides/source/flexi...

[3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c...

[4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem...

[5] https://github.com/huggingface/swift-transformers

[6] https://github.com/huggingface/exporters

[7] https://developer.apple.com/documentation/metal/gpu_devices_...

[8] https://github.com/ml-explore/mlx/issues/18

[9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm...

[10] https://testflight.apple.com/join/ERFxInZg


This is really interesting, thank you.

What would be the downside to padding all inputs to have consistent input token size?


Conceptually, to the best of my understanding, nothing too serious; perhaps the inefficiency of processing a larger input than necessary?

Practically, a few things:

If you want to have your cake & eat it too, they recommend Enumerated Shapes[1] in their coremltools docs, where CoreML precompiles up to 128 (!) variants of input shapes, but again this is fairly limiting (1 tok, 2 tok, 3 tok... up to 128 token prompts.. maybe you enforce a minimum, say 80 tokens to account for a system prompt, so up to 200 tokens, but... still pretty short). But this is only compatible with CPU inference, so that reduces its appeal.

It seems like its current state was designed for text embedding models, where you normalize input length by chunking (often 128 or 256 tokens) and operate on the chunks — and indeed, that’s the only text-based CoreML model that Apple ships today, a Bert embedding model tuned for Q&A[2], not an LLM.

You could used a fixed input length that’s fairly large; I haven’t experimented with it once I grasped the memory requirements, but from what I gather from HuggingFace’s announcement blog post[3], it seems that is what they do with swift-transformers & their CoreML conversions, handling the details for you[4][5]. I haven’t carefully investigated the implementation, but I’m curious to learn more!

You can be sure that no one is more aware of all this than Apple — they published "Deploying Transformers on the Apple Neural Engine" in June 2022[6]. I look forward to seeing what they cook up for developers at WWDC this year!

---

[1] "Use `EnumeratedShapes` for best performance. During compilation the model can be optimized on the device for the finite set of input shapes. You can provide up to 128 different shapes." https://apple.github.io/coremltools/docs-guides/source/flexi...

[2] BertSQUAD.mlmodel (fp16) https://developer.apple.com/machine-learning/models/#text

[3] https://huggingface.co/blog/swift-coreml-llm#optimization

[4] `use_fixed_shapes` "Retrieve the max sequence length from the model configuration, or use a hardcoded value (currently 128). This can be subclassed to support custom lengths." https://github.com/huggingface/exporters/pull/37/files#diff-...

[5] `use_flexible_shapes` "When True, inputs are allowed to use sequence lengths of `1` up to `maxSequenceLength`. Unfortunately, this currently prevents the model from running on GPU or the Neural Engine. We default to `False`, but this can be overridden in custom configurations." https://github.com/huggingface/exporters/pull/37/files#diff-...

[6] https://machinelearning.apple.com/research/neural-engine-tra...


great high effort answer, thanks so much!

to prod you to sell yourself a bit more - what is the goal/selling point of cnvrs?


Oh man I’m a big fan, swyx!! Latent Space & AI.engineer are fantastic resources to the community. Thank you for the kind words & the prompt!

It’s still early days, but at a high level, I have a few goals: - expand accessibility and increase awareness of the power & viability of small models — the scene can be quite impenetrable for many! - provide the an easy to use, attractive, efficient app that’s a good platform citizen, taking full advantage of Apple’s powerful device capabilities; - empower more people to protect their private conversation data, which has material value to large AI companies; - incentivize more experimentation, training & fine-tuning efforts focused on small, privately-runnable models.

I’d love to one day become your habitual ChatGPT alternative, as high a bar as that may be.

I have some exciting ideas, from enabling a user generated public gallery of characters; to expanding into multimodal use cases, like images & speech; composing larger workflows on top of LLMs, similar to Shortcuts; grounding open models against web search indices for factuality; and further out, more speculative ideas, including exposing tools like JavaScriptCore to models as a tool, like Python in ChatGPT’s code interpreter.

But I’m sure you’ve also given a lot of thought to the future of AI on device with smol — what are some dreams you have for truly private AI that’s always with you?


i dont dream of truly private ai like that haha. im a pretty open book. but very very glad to see more options in the local ai space!


you can absolutely access and continue all your past chats in cnvrs!

would love to hear what you think: https://testflight.apple.com/join/ERFxInZg


EDIT: Attempting to converse with any Q4_K_M 7B parameter model on a 15 Pro Max... the phone just melts down. It feels like it is producing about one token per minute. MLC-Chat can handle 7B parameter models just fine even on a 14 Pro Max, which has less RAM, so I think there is an issue here.

EDIT 2: Even using StableLM, I am experiencing a total crash of the app fairly consistently if I chat in one conversation, then start a new conversation and try to chat in that. On a related note, since chat history is saved... I don't think it's necessary to have a confirmation prompt if the user clicks the "new chat" shortcut in the top right of a chat.

-----

That does seem much nicer than MLC Chat. I really like the selection of models and saving of conversations.

It looks like you’re still using the old version of TinyLlama. The 1.0 release is out now: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGU...

Microsoft recently re-licensed Phi-2 to be MIT instead of non-commercial, so I would love to see that in the list of models. Similarly, there is a Dolphin-Phi fine tune.

The topic of discussion here is Mistral-7B v0.2, which is also missing from the model list, unfortunately. There are a few Mistral fine tunes in the list, but obviously not the same thing.

I also wish I could enable performance metrics to see how many tokens/sec the model was running at after each message, and to see how much RAM is being used.

On the whole, this app seems really nice!


Wow, thanks so much for taking the time to test it out and share such great feedback!

Thrilled about all those developments! More model options as well as link-based GGUF downloads on the way.

On the 7b models: I’m very sorry for the poor experience. I wouldn’t recommend 7b over Q2_K at the moment, unless you’re on a 16GB iPad (or an Apple Silicon Mac!). This needs to be much clearer, as you observed the consequences can be severe. The larger models, and even 3b Q6_K can be crash prone due to memory pressure. Will work on improve handling of low level out-of-memory errors very soon.

Will also investigate the StableLM crashes, I’m sorry about that! Hopefully Testflight recorded a trace. Just speculating, it may be a similar issue to the larger models, due to the higher-fidelity quant (Q6_K) combined with the context length eventually running out of RAM. Could you give the Q4_K_M a shot? I heard something similar from a friend yesterday, I’m curious if you have a better time with that — perhaps that’s a more sensible default.

Re: the overly-protective new chat alert, I agree, thanks for the suggestion. I’ll incorporate that into the next build. Can I credit you? Let me know how you’d like for me to refer to you, and I’d be happy to.

Finally, please feel free to email me any further feedback, and thanks again for your time and consideration!

britt [at] bl3 [dot] dev


I just checked and MLC Chat is running the 3-bit quantized version of Mistral-7B. It works fine on the 14 Pro Max (6GB RAM) without crashing, and is able to stay resident in memory on the 15 Pro Max (8GB RAM) when switching with another not-too-heavy app. 2-bit quantization just feels like a step too far, but I’ll give it a try.

Regarding credit, I definitely don’t need any. Just happy to see someone working on a better LLM app!


FYI, just submitted a new update for review with a few small but hopefully noticeable changes, thanks in no small part to your feedback:

1. StableLM Zephyr 3b Q4_K_M is now the built-in model, replacing the Q6_K variant.

2. More aggressive RAM headroom calculation, with forced fallback to CPU rather than failing to load or crashing.

3. New status indicator for Metal when model is loaded (filled bolt for enabled, vs slashed bolt for disabled.)

4. Metal will now also be enabled for devices with 4GB RAM or less, but only when the selected model can comfortably fit in RAM. Previously, only devices with at least 6GB had Metal enabled.

Thank you so much again for your time!


The fallback does seem to work! Although the 4-bit 7B models only run at 1 token every several seconds.

I still wish Phi-2, Dolphin Phi-2, and TinyLlama-Chat-v1.0 were available, but I understand you have plans to make it easier to download any model in the future.


4-bit StableLM and 2-bit 7B models do seem to be working more consistently.


That’s great to hear. I’m sorry again about that poor experience, and please do reach out if you have any other feedback!

Britt


My free / mostly open source app also stores conversation history, synced via iCloud

https://ChatOnMac.com

edit: I can't reply to you below: Do you have the right app, there's no TestFlight just App Store link - if it's ChatOnMac then it should have a dropdown at the top of the chat room to select a model. If it's empty or otherwise bugged out please let me know what you see in the top menu. It filters the available model presets based on how much RAM you have available, so let me know what specific device you have and I can look into it. Thank you.

The model presets are also configurable by forking the bot and loading your own via GitHub (bots run inside sandboxed hidden webviews inside the app). But this is not ergonomically friendly just yet.


I was excited when I saw this, but I'm having trouble with it (and it looks like I'm not the only one). As others have pointed out, the download link on your site does open TestFlight. I've since deleted that version and installed the official version from the AppStore after revisiting this thread in search of answers.

I now have the full version installed on my iPhone 15 pro, and I have added my OpenAI key, but none of the models I've selected (3.5 Turbo, 4, 4 Turbo) work. My messages in the chat have a red exclamation next to them which opens an error message stating 'Load failed' when clicked. If I click 'Retry Message' the entire app crashes.


Apologies for the rough edges and bad experience - I’ve just soft launched without announcement til this post. I will have a hotfix up soon. Thanks for the report.


No stress. Best of luck!


> Do you have the right app, there's no TestFlight just App Store link

On chatonmac.com, the "Download on the App Store" button does not link the App Store for me either - I get a modal titled "Public Beta & Launch Day News" with "Join the TestFlight Beta" and "Launch Day Newsletter Signup Form".


Hello, I like your app and the ethics you push forward. Do you plan to add the possibility to request for Dall-E 3 images within the chat? I’ve yet to find an app which does that and makes me use my own api key


It’s planned. This is just the v1 MVP. I’ll have a hotfix out soon. Thanks for the suggestion and context


Hey I tried the TestFlight. What are the steps after a fresh download for hooking it up to model?

I saw you can spec an OpenAI key but presume it would take llama or something else.


This is really nice to use. Especially compared to MLC. Well done!


Thank you so much for taking the time to try it out!


would love for you to give cnvrs a shot!

- save characters (system prompt + temperature, and a name & cosmetic color) - download & experiment with models from 1b, 3b, & 7b, and quant options q2k, q4km, q6k - save, search, continue, & export past chats

along with smaller touches: - custom theme colors - haptics

and more coming soon!

https://testflight.apple.com/join/ERFxInZg


Do not download this.

I downloaded this on my 14 Pro and it completely locked up the system to the point where even the power button wouldn’t work. I couldn’t use my phone for about 10 minutes.


Quick follow-up:

I’ve just submitted a new update for review with a few small but hopefully noticeable changes, thanks to your feedback:

1. StableLM Zephyr 3b Q4_K_M is now the built-in model, replacing the Q6_K variant.

2. More aggressive RAM headroom calculation, with forced fallback to CPU rather than failing to load or crashing/hanging in such a nasty fashion.

3. New status indicator for Metal when model is loaded (filled bolt for enabled, vs slashed bolt for disabled.)

4. Metal will now also be enabled for devices with 4GB RAM or less, but only when the selected model can comfortably fit in RAM. Previously, only devices with at least 6GB ever had Metal enabled.

I really appreciate your taking the time to test — the hanging you experienced was unacceptable, and I truly am sorry for the inconvenience. I hope you’ll give it another chance once this update is live, but either way I’m grateful for your help in isolating and eliminating this issue!

Britt


I've used it for a couple weeks on my 15 Pro and I haven't experienced anything like that. (IMO it's well worth the download)

The developer is also pretty responsive and actively looking for feedback (which is why it's currently free on TestFlight)


I’m very sorry about your experience. That’s definitely not what I was aiming for, and I can imagine that was a nasty surprise. Any hang like that is unacceptable, full stop.

My understanding is Metal is currently causing hangs on devices when there is barely enough RAM to fit the model and prompt, but not quite enough to run. Will work on falling back to CPU to avoid this kind of experience much more aggressively than today.

Thank you for taking the time to both try it out and to share your experience; I will use it to ensure it’s better in the future.


Thanks for the response. Unfortunately on my device the behavior makes it impossible to report a bug using a screenshot as requested in the app. I can give you more device info if you want to narrow down the cause.


Yes of course, I would very much appreciate that, if you’d be so generous — thank you! You can email britt [at] bl3 [dot] dev


That is an iOS bug. No app should be able to do this.

So rather than reporting in the app you can report it in Feedback Assistant, if you want to.


Exactly the same here - full lock up for 2 minutes without being able to reboot even with hardware buttons.


I’m very sorry to hear you had such a poor experience as well. I’m sure it’s little consolation at this point having been inconvenienced as you have — it’s certainly not what I aim for in my work!

I’ve just submitted a new update for review with a number of small but material changes to address these issues: https://news.ycombinator.com/item?id=38920916

I hope you’ll consider giving it another shot once that’s live, and thank you for taking the time not just to test but also to report your experience!

Britt


Thanks. I did test your new version but unfortunately similar issues. App completely hung and entire OS was sluggish. iPhone 13 Pro, iOS 17.1.2. Unfortunately I won’t have time to test any more but very good luck with the project.


This crashes on almost all models for me and also locked up my phone such that only a full reboot would fix it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: