I am assigned to develop a company internal chatbot that accesses confidential d...

krisoft · 2025-08-20T08:38:45 1755679125

> This means Vector databases, Search Indexes or fancy "AI Search Databases" would be required on a per user basis or track the access rights along with the content, which is infeasible and does not scale.

I don't understand why you think tracking user access rights would be infeasible and would not scale. There is a query. You search for matching documents in your vector database / index. Once you have found the potentially relevant list of documents you check which ones can the current user access. You only pass the ones over to the LLM which the user can see.

This is very similar to how banks provide phone based services. The operator on the other side of the line can only see your account details once you have authenticated yourself. They can't accidentally tell you someone else's account balance, because they themselves don't have access to it unless they typed in all the information you provide them to authenticate yourself. You can't trick the operator to provide you with someone else's account balance because they can't see the account balance of anyone without authenticating first.

doomslice · 2025-08-20T11:37:36 1755689856

Let's say you have 100000 documents in your index that match your query but only 10 of them the user has access to:

A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them. Most of the time, you've now eliminated all of your search results.

Your search must be access aware to do a reasonable job of pre-filtering the content to documents the user has access to, at which point you then can apply post-filtering with the "100% sure" access check.

DannyBee · 2025-08-20T12:03:00 1755691380

Yes. But this is still an incredibly well known and solved problem. As an example - google's internal structured search engines did this decades ago at scale.

Isn0gud · 2025-08-20T12:32:44 1755693164

Which solutions are you referring to? With access that is highly diverse and changing, this is still an unsolved problem to my knowledge.

jsnell · 2025-08-20T12:58:56 1755694736

Probably Google Zanzibar (and the various non-Google systems that were created as a result of the paper describing Zanzibar).

senko · 2025-08-20T14:35:33 1755700533

Just use a database that supports both filtering and vector search, such as postgres with pgvector (or any other, I think all are adding vector search nowadays).

pierrebrunelle · 2025-08-20T15:01:18 1755702078

Agree...as simple as:

@pxt.query def search_documents(query_text: str, user_id: str): sim = chunks.text.similarity(query_text) return ( chunks.where( (chunks.user_id == user_id) # Metadata filtering & (sim > 0.5) # Filter by similarity threshold & (pxt_str.len(chunks.text) > 30) # Additional filter/transformation ) .order_by(sim, asc=False) .select( chunks.text, source_doc=chunks.document, # Ref to the original document sim=sim, title=chunks.title, heading=chunks.heading, page_number=chunks.page ) .limit(20) )

For instance in https://github.com/pixeltable/pixeltable

hannasanarion · 2025-08-21T05:59:57 1755755997

The thing about a user needing access to only 10 documents is that creating a new index from scratch on those ten documents takes basically zero time.

Vector Databases intended for this purpose filter this way by default for exactly this reason. It doesn't matter how many documents are in the master index, it could be 100000 or 100000000,doesn't matter. Once you filter down to the 10 that your user is allowed to see, it takes the same tenth of a second or whatever to whip up a new bespoke index just for them for this query.

Pre-search filtering is only a problem when your filter captures a large portion of the original corpus, which is rare. How often are you querying "all documents that Joe Schmoe isn't allowed to view"?

seanw265 · 2025-08-20T15:17:10 1755703030

If you can move your access check to the DB layer, you skip a lot of this trouble.

Index your ACLs, index your users, index your docs. Your database can handle it.

smarx007 · 2025-08-20T12:31:21 1755693081

Apache Accumulo solved the access-aware querying a while ago.

perlgeek · 2025-08-20T14:17:41 1755699461

"Fun" Fact: ServiceNow simply passes this problem on to its users.

I've seen a list of what was supposed to be 20 items of something, it only showed 2, plus a comment "18 results were omitted to insufficient permissions".

(Servicenow has at least three different ways to do permissions, I don't know if this applies to all of them).

plaguuuuuu · 2025-08-20T16:12:17 1755706337

I'm not sure if enumerating the hidden results are a great idea :0

perlgeek · 2025-08-20T16:29:25 1755707365

At least it's terrible user experience to have to click on the "more" button several times to see the number of items you actually wanted to see.

But yes, one could probably also construct a series of queries that reveal properties of hidden objects.

thunky · 2025-08-20T11:59:57 1755691197

> Let's say you have 100000 documents in your index that match your query

If the docs were indexed by groups/roles and you had some form of RBAC then this wouldn't happen.

beaviskhan · 2025-08-20T12:25:30 1755692730

If you take this approach, you have to reindex when groups/roles changes - not always a feasible choice

thunky · 2025-08-20T12:41:22 1755693682

You only have to update the metadata, not do a full reindex.

scott_w · 2025-08-20T12:51:27 1755694287

You'd have to reindex the metadata (roles access), which may be substantial if you have a complex enough schema with enough users/roles.

thunky · 2025-08-20T14:05:28 1755698728

> You'd have to reindex the metadata (roles access), which may be substantial if you have a complex enough schema with enough users/roles.

Right, but this compare this to the original proposal:

> A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them

Using an index is much better than that.

And it should be possible to update the index without a substantial cost, since most of the 100000 documents likely aren't changing their role access very often. You only have to reindex a document's metadata when that changes.

This is also far less costly than updating the actual content index (the vector embeddings) when the document content changes, which you have to do regardless of your permissions model.

hannasanarion · 2025-08-21T06:11:04 1755756664

I don't understand how "using an index" is a solution to this problem. If you're doing search, then you already have an index.

If you use your index to get search results, then you will have a mix of roles that you then have to filter.

If you want to filter first, then you need to make a whole new search index from scratch with the documents that came out of the filter.

You can't use the same indexing information from the full corpus to search a subset, your classical search will have undefined IDF terms and your vector search will find empty clusters.

If you want quality search results and a filter, you have to commit to reindexing your data live at query time after the filter step and before the search step.

I don't think Elastic supports this (last time I used it it was being managed in a bizarre way, so I may be wrong). Azure AI Search does this by default. I don't know about others.

thunky · 2025-08-21T11:31:36 1755775896

> I don't understand how "using an index" is a solution to this problem. If you're doing search, then you already have an index

It's a separate index.

You store document access rules in the metadata. These metadata fields can be indexed and then use as a pre-filter before the vector search.

> I don't think Elastic supports this

https://www.elastic.co/docs/solutions/search/vector/knn#knn-...

chaosite · 2025-08-20T09:01:46 1755680506

> You search for matching documents in your vector database / index. Once you have found the potentially relevant list of documents you check which ones can the current user access. You only pass the ones over to the LLM which the user can see.

Sometimes the potentially relevant list of documents itself is a leak all by itself.

redwood · 2025-08-20T09:43:59 1755683039

But you process that list in a trusted audited app tier not in the client environment

lixtra · 2025-08-20T10:14:30 1755684870

A naive approach could still leak information through side channels. E.g. if you search regularly for foobar, the answer might suddenly get slower if foobar appears more in the document base.

Depending on the context it could be relevant.

hannasanarion · 2025-08-21T06:13:40 1755756820

But we're talking about access control, so in this case "filtering for foobar" means "filtering for stuff I'm allowed to see", and the whole point is that you can never turn that filter off to get a point of comparison.

If Joe's search is faster than Sally's because Sally has higher permissions, that's hardly a revelation.

shawnz · 2025-08-20T14:14:07 1755699247

That's nothing specific to LLM-enhanced search features though, right? Any search feature will have that side channel risk

lukan · 2025-08-20T10:10:53 1755684653

Thank you for context, I wondered the same.

But I guess they want something like training the chatbot as a LLM once with all the confidential data - and then indeed you could never separate it again.

planb · 2025-08-20T20:31:47 1755721907

I’ll answer to this as a placeholder for all the „just do xyz“ replies:

Searching the whole index and then filtering is possible, but infeasible for large indexes where a specific user only has access to a few docs. And for diverse data sources (as we want to access), this would be really slow, many systems would need to be checked.

So, access rights should be part of the index. In that case, we are just storing a copy of the access rights, so this is prone to races. Besides that, we have multiple systems with different authorization systems, groups, roles, whatever. To homogenize this, we would need to store the info down to each individual user. Besides this, not all systems even support asking which users have access to resource Y, they only allow to ask „has X access to Y“.

scott_w · 2025-08-20T12:50:18 1755694218

> I don't understand why you think tracking user access rights would be infeasible and would not scale.

Allow me to try to inject my understanding of how these agents work vs regular applications.

A regular SaaS will have an API endpoint that has permissions attached. Before the endpoint processes anything, the user making the request has their permissions checked against the endpoint itself. Once this request succeeds, anything that endpoint collects is considered "ok" ship to the user.

AI Agents, instead, directly access the database, completely bypassing this layer. That means you need to embed the access permissions into the individual rows, rather than at the URL/API layer. It's much more complex as a result.

For your bank analogy: they actually work in a similar way to how I described above. A temporary access is granted to the resources but, once it's granted, any data included in those screens is assumed to be ok. They won't see something like a blank box somewhere because there's info they're not supposed to see.

DISCLAIMER: I'm making an assumption on how these AI Agents work, I could be wrong.

krisoft · 2025-08-20T15:17:13 1755703033

> AI Agents, instead, directly access the database, completely bypassing this layer.

If so, then as the wise man says: "well, there‘s your problem!"

I don't doubt there are implementations like that out there, but we should not judge the potential of a technology by the mistakes of the most boneheaded implementation.

Doing the same in the bank analogy would be like giving root SQL access to the phone operators and then asking them pretty please to be careful with it.

scott_w · 2025-08-20T17:17:56 1755710276

> If so, then as the wise man says: "well, there‘s your problem!"

Of course, I wouldn't defend this! To be clear, it's not possible to know how every AI Agent works, I just go off what I've seen when a company promises to unlock analytics insights on your data: usually by plugging directly into the Prod DB and having your data analysts complain whenever the engineers change the schema.

> we should not judge the potential of a technology by the mistakes of the most boneheaded implementation.

I agree.

hannasanarion · 2025-08-21T06:17:40 1755757060

Having the agent plugged into the DB doesn't mean the agent can see everything in the DB. If that plug includes an automatic "where current user has access" filter, then the agent can't know anything the user can't know.

That's what the bank agent analogy was meant to tell you. The agent has a direct line to the prod DB through their computer terminal, but every session they open is automatically constrained to the account details if the person on the phone right now and nobody else.

scott_w · 2025-08-21T07:58:00 1755763080

> Having the agent plugged into the DB doesn't mean the agent can see everything in the DB. If that plug includes an automatic "where current user has access" filter, then the agent can't know anything the user can't know.

It depends on how it's plugged-in. If you just hand it a connection and query access then what exactly stops it? In a lot of SaaS systems, there's only the "application" user, which is restricted via queries within the API.

You can create a user in the DB per user of your application but this isn't free. Now you have the operational problem of managing your permissions, not via application logic, but subject to the rules and restrictions of your DBMS.

You can also create your own API layer on top, however this also comes with constraints of your API and adding protections on your query language.

None of this is impossible but, given what I've seen happen in the data analytics space, I can tell you that I know which option business leaders opt for.

GloriousMEEPT · 2025-08-20T13:04:54 1755695094

The solution to this problem is to develop your agents to use delegation and exchange tokens for access to other services using an on-behalf-of flow. Agents are never operating under their own identity, but as the user.

pc86 · 2025-08-20T13:07:39 1755695259

> I'm making an assumption on how these AI Agents work, I could be wrong.

I don't understand the desire - borderline need - of folks on HN to just make stuff up. That is likely why you're being downvoted. I know we all love to do stuff "frOM fIRsT PRiNcIPlEs" around here but "let me just imagine how I think AI agents work then pass that off as truth" is taking it a bit far IMO.

This is the human equivalent of an AI hallucination. You are just making stuff up, passing it off as truth ("injecting your understanding"), then adding a one-line throwaway "this might be completely wrong lol" at the end.

krapp · 2025-08-20T13:28:22 1755696502

>I don't understand the desire - borderline need - of folks on HN to just make stuff up.

Hacker News is addictive. This forum is designed to reward engagement with imaginary internet points and that operant conditioning works just as well here as everywhere else.

tough · 2025-08-20T16:00:45 1755705645

karma please go up

scott_w · 2025-08-20T13:11:14 1755695474

And yet if we didn't do this, HN would be almost completely silent because 99% of commenters have a clue what they're talking about most of the time and nobody would ever have a chance to learn.

mh- · 2025-08-20T14:26:07 1755699967

I don't know how to say this less flippantly, and I honestly tried: you could have simply posted a comment phrased as a question, and 20 people would have jumped in to answer.

(To your point, >15 of them would have had different answers and the majority would have been materially wrong, but still.)

scott_w · 2025-08-20T14:59:35 1755701975

So let me be more direct. The part I'm not confident I'm correct in is this:

> AI Agents, instead, directly access the database

However, I don't think I'd be too far off the mark given many systems work like this (analytics tools typically hook into your DB, to the chagrin of many an SRE/DevOps) and it's usually marketed as the easy solution. Also, I've since read a few comments and it appears I'm pretty fucking close: the agents here read a search index, so pretty tightly hooked into a DB system.

Everything else, I know I'm right (I've built plenty of systems like this), and someone was making a point that permissions access does scale. I pointed out that it appears to scale because of the way they're designed.

I'd say most of my comment is substantively correct, with a disclaimer on an (important) point, where I'd be happy to be corrected.

Karrot_Kream · 2025-08-21T20:28:29 1755808109

The correct way to setup an analytics tool is to point it to an analytics db that is a replica of your main DB. It's a pretty common part of an HA setup to replicate your primary to an actual hot read replica and a cold analytics store. This way the analytics tool queries your analytics store and doesn't put load on your hot primary or hot read replica.

> I'd say most of my comment is substantively correct, with a disclaimer on an (important) point, where I'd be happy to be corrected.

I read this and feel that you still want imaginary internet points for something that is, at best, directionally correct. To me it seems your desire for internet points urged you to post a statement and not a question. I imagine most of HN is just statements that are overconfident bluster by only directionally correct statements which create the cacophony of this site.

scott_w · 2025-08-22T05:30:48 1755840648

> The correct way to setup an analytics tool is to point it to an analytics db that is a replica of your main DB. It's a pretty common part of an HA setup to replicate your primary to an actual hot read replica and a cold analytics store. This way the analytics tool queries your analytics store and doesn't put load on your hot primary or hot read replica.

That doesn’t solve the problem of changing schemas causing issues for your data team at all. Something I see regularly. If you setup an AI Agent the same way you still give it full access, so you still haven’t fixed the problem at hand.

> I read this and feel that you still want imaginary internet points for something that is, at best, directionally correct.

And you’ve yet to substantiate your objection to what I posited (alongside everyone else), so instead you continue to talk about something unrelated in the hope of… what, exactly?

awirth · 2025-08-20T13:07:50 1755695270

What you're describing is a specific case of a confused deputy problem: https://en.wikipedia.org/wiki/Confused_deputy_problem

This is captured in the OWASP LLM Top 10 "LLM02:2025 Sensitive Information Disclosure" risk: https://genai.owasp.org/llmrisk/llm022025-sensitive-informat... although in some cases the "LLM06:2025 Excessive Agency" risk is also applicable.

I believe that some enterprise RAG solutions create a per user index to solve this problem when there are lots of complex ACLs involved. How vendors manage this problem is an important question to ask when analyzing RAG solutions.

At my current company at least we call this "権限混同" in Japanese - Literally "authorization confusion" which I think is a more fun name

lmeyerov · 2025-08-20T13:57:24 1755698244

Exactly. We often end up doing 'direct' retrieval (ex: DB query gen) to skip the time suck , costs , and insecurity of vector RAG, and per user indexing for the same. Agentic reasoning loops means this can be better quality and faster anyways.

Sometimes hard to avoid though, like our firehose analyzers :(

carschno · 2025-08-20T08:25:28 1755678328

> I am having a really hard time communicating this problem to executives

When you hit such a wall, you might not be failing to communicate, nor them failing to understand. In reality, said executives have probably chosen to ignore the issue, but also don't want to take accountability for the eventual leaks. So "not understanding" is the easiest way to blame the engineers later.

lupusreal · 2025-08-20T08:56:57 1755680217

It doesn't even need to be blaming the engineers in this case, they can blame "the AI" and most people will accept that and let whatever incident happened slide. If somebody questions the wisdom of putting AI in such a position, they can be dismissed as not appreciating new technology (even though their concern is valid.)

jeltz · 2025-08-20T11:10:29 1755688229

Yeah, it is usually not about blaming the engineers in my experience. It is about so they can make a descion they want to make without having to think too hard or take any accountability. If nobody knew at the time it was bad everyone can just act surprised and call it an accident and just go on with their lives making similar uninformed descisions.

In their dream world the engineers would not know about it either.

Edit: Maybe we should call this style vibe management. :D

flir · 2025-08-20T13:44:23 1755697463

"the AI did it" is going to be the new "somebody hacked my facebook account"

I wish I had a way of ensuring culpability remains with the human who published the text, regardless of who/what authored it.

tough · 2025-08-20T16:02:48 1755705768

if you're in a regulated field like law or medicine and you fuck up signing some AI slop with your name, you should loose your license at the very least

tools are fine to use, personal responsability is still required. Companies already fuck up with this too much

flir · 2025-08-21T09:41:34 1755769294

I think it needs to be a cultural expectation. I don't know how we get there, though.

Lu2025 · 2025-08-20T13:36:53 1755697013

Yep. AI is wonderful for IP laundering and accountability laundering (is this even a term? It is now!)

admissionsguy · 2025-08-20T14:21:32 1755699692

worse, they can be dismissed as an abstract ”ai is dangerous” and used to justify funnelling money to the various ai safety charlatans

p3rls · 2025-08-20T11:23:39 1755689019

In this case it looks like the executives should fire the OP and hire the 2nd poster who came up with a solution. C'mon lazy executives.

inejge · 2025-08-20T09:00:33 1755680433

> I am having a really hard time communicating this problem to executives

Cc Legal/Compliance could do wonders to their capacity to understand the problem. Caveat, of course, that the execs might be pissed off that some peon is placing roadblocks in the way of their buzzword-happy plan.

planb · 2025-08-20T09:27:03 1755682023

That would surely be a possible way, but I don't want to block anything, I just want reasonable expectations and a basic understanding of the problem on all sides.

jacquesm · 2025-08-20T09:42:03 1755682923

If you start CC'ing legal or compliance on such issues you may very well need a planb.

Lu2025 · 2025-08-20T13:30:16 1755696616

This is correct. People on top are very much ego driven and don't forgive those who say no to them or make them look bad.

DannyBee · 2025-08-20T12:05:48 1755691548

"would be required on a per user basis or track the access rights along with the content, which is infeasible and does not scale"

Citation needed.

Most enterprise (homegrown or not) search engine products have to do this, and have been able to do it effectively at scale, for decades at this point.

This is a very well known and well-solved problem, and the solutions are very directly applicable to the products you list.

It is, as they say, a simple matter of implementation - if they don't offer it, it's because they haven't had the engineering time and/or customer need to do it.

Not because it doesn't scale.

malfist · 2025-08-20T12:16:36 1755692196

If you're stringing together a bunch of MCPs you probably also have to string together a bunch of authorization mechanisms. Try having your search engine confirm live each persons access to each possible row.

It's absolutely a hard problem and it isn't well solved

DannyBee · 2025-08-20T13:19:03 1755695943

Yes, if you try to string together 30 systems with no controls and implement controls at the end it can be hard and slow - "this method i designed to not work doesn't work" is not very surprising.

But the reply i made was to " This means Vector databases, Search Indexes or fancy "AI Search Databases" would be required on a per user basis or track the access rights along with the content, which is infeasible and does not scale."

IE information retrieval.

Access control in information retrieval is a very well studied.

Making search engines, etc that effectively confirm user access to each possible record is feasible and common (They don't do it exactly this way but the result is the same), and scalable.

Hell, we even known how to do private information retrieval with access control in scalable ways.

PIR = the server does not know what the query was, or the result was, but still retrieves the result.

So we know how to make it so not only does the server does not know what was queried or retrieved by a user, but each querying user still only can access records they are allowed to.

Overhead of this, which is much harder than non-private information retrieval with access control, is only 2-3x in computation. See, e.g., https://dspace.mit.edu/handle/1721.1/151392 for one example of such a system. There are others.

So even if your 2ms retrieval latency was all CPU and 0 I/O, it would only become 4-6ms do to this.

If you remove the PIR part, as i said, it's much easier, and the overhead is much much less, since it doesn't involve tons and tons of computationally expensive encryption primitives (though some schemes still involve some).

fkyoureadthedoc · 2025-08-20T14:22:19 1755699739

I don't know the details, but I know if I give our enterprise search engine/api a user's token it only returns documents they are allowed to access.

kephasp · 2025-08-26T13:39:08 1756215548

Do you know papers or technical reports that demonstrate the scalability of authorization-preserving search indexes?

I don't doubt they exist but what we hear about are the opposite cases, where this was obviously not implemented and sensitive data was leaked.

giamma · 2025-08-20T13:02:08 1755694928

I believe most vector databases allow you to annotate vectors with additional metadata. Why not simply add as metadata the list of principals (roles/groups) who have access to the information (e.g. HR, executives) ? Then when a user makes a request to the chatbot, you expand the user identity to his/her principals (e.g. HR) and use those as implicit filtering criteria for finding the closest vectors in the database.

In this way you exclude up-front the documents that the current user cannot see.

Of course, this requires you to update the vector metadata any time the permissions change at the document level (e.g. a given document originally visible only to HR is now also visibile to executives -> you need to add the principal executives to the metadata of the vector resulting from the document in your vector database)

sporkland · 2025-08-20T14:50:57 1755701457

This is the correct answer. You do a pre-filter on a permissions correlated field like this and post-filter on the results for the deeper perms checks.

planb · 2025-08-20T20:37:33 1755722253

I am in control of the vector database and the search index. I have no control over the different accessed data sources that don’t even allow to query access rights per resource (and just allow for can_access checks for a given user)

everdrive · 2025-08-20T12:04:42 1755691482

>communicating this problem to executives

I don't just mean this as lazy cynicism; executives don't really want to understand things. It doesn't suit their goals. They're not really in the business of strictly understanding things. They're in the business of "achieving success." And, in their world, a lot of success is really just the perception of success. Success and the perception of success are pretty interchangeable in their eyes, and they often feel that a lot of engineering concerns should really be dismissed unless those concerns are truly catastrophic.

thewebguyd · 2025-08-20T14:25:12 1755699912

> they often feel that a lot of engineering concerns should really be dismissed unless those concerns are truly catastrophic.

Grizzled sysadmin here, and this is accurate. Classic case of "Hey boss, I need budget for server replacements, this hardware is going to fail." Declined. few months later, fails. Boss: "Why did you allow this to happen, what am I even paying you for?"

kephasp · 2025-08-26T13:36:04 1756215364

One typical way to resolve this is to use Voluntary Oblivious Compliance (VOC). In this capability-based pattern, every storage service could provide an opaque handler to a user that represents their authorization and it can be used to restrict on which documents indexing is done.

http://wiki.erights.org/wiki/Walnut/Secure_Distributed_Compu...

http://www.skyhunter.com/marcs/ewalnut.html#proofOfPurchase

If the opaque handle is part of the Membrane pattern, you can even avoid most race conditions, because even during the indexing, the capabilities can be used to access documents and that removes the possibility of a TOCTOU race.

http://wiki.erights.org/wiki/Walnut/Secure_Distributed_Compu...

jsshapiro · 2025-08-26T00:05:54 1756166754

Depending on how you construct this, it may be a lot harder than you are saying.

If your approach is to build a chatbot that scans the documents, you can enforce access on a per-session basis by limiting the documents available to the application.

But if the approach is to train a neural net on this body of sensitive documents then you have a bigger problem. Actually two. The first is that the access control requirements have to be accounted for in the scoring function, which amounts to building a different engine for each user context. Though I suppose you could think of it as a composed neural net whose first net maps the input onto the [0,1] range on a per-user basis using the access control rules, and whose second net takes those results and runs them through additional layers.

The second is that the trailing neural net won't converge the same way for different inputs, and (so far as I'm aware) there isn't any theory for how to propagate access restrictions across a neural net.

Before inventing an old wheel, does anybody know of work on this in the research literature?

justincormack · 2025-08-25T21:44:51 1756158291

I listened to a podcast from someone at Glean about this, they do that, but she pointed out it is not enough as permissions are wrong often, and a good AI search can find you a bunch of documents with salary information that you have permission to see bit should not.

smarx007 · 2025-08-20T11:44:59 1755690299

Two points/questions:

1. Why is tracking access rights "on a per user basis or [...] along with the content" is not feasible? A few mentions: Google Zanzibar (+Ory Keto as OSS impl) - makes authz for content othoronal to apps (i.e. possible to have it in one place, s.t. both Jira and a Jira MCP server can use the same API to check authz - possible to have a 100% faithful authz logic in the MCP server), Eclipse Biscuit (as far as I understand, this is a Dassault's attempt to make JWTs on steroids by adding Datalog and attenuation to the tokens, going in the Zanzibar direction but not requiring a network call for every single check), Apache Accumulo (DBMS with a cell-level security) and others. The way I see it, the tech is there but so far, not enough attention has been put on the problem of a high-fidelity authz throughout the enterprise on a granular level.

2. What is the scale needed? Enterprises with more than 10000 employees are quite rare, many individual internal IT systems even in large companies have less than 100 regular users. At these levels of scale, a lot more approaches are feasible that would not be considered possible at Google scale (i.e. more expensive algorithms w.r.t. big-O are viable).

PunchyHamster · 2025-08-20T11:50:08 1755690608

Because the problem is not "get a list of what user can access" but "the AI that got trained on dataset must not leak to user that doesn't have access to it.

There is no feasible way to track that during training (at least yet), so only current solution would be to learn AI agent only on data use can access and that is costly

smarx007 · 2025-08-20T12:20:45 1755692445

Who said it must be done during training? Most of the enterprise data is accessed after training - RAG or MCP tool calls. I can see how the techniques I mentioned above could be applied during RAG (in vector stores adopting Apache Accumulo ideas) or in MCP servers (MCP OAuth + RFC 8693 OAuth 2.0 Token Exchange + Zanzibar/Biscuit for faithfully replicating the authz constraints of systems where the data is being retrieved from).

ForHackernews · 2025-08-20T11:49:49 1755690589

As I understand it, there's no real way to enforce access rights inside an LLM. If the bot has access to some data, and you have access to the bot, you can potentially trick it into coughing up the data regardless of whether you're supposed to see that info or not.

smarx007 · 2025-08-20T12:23:47 1755692627

MCP tools with OAuth support + RFC 8693 OAuth 2.0 Token Exchange (aka OAuth 2.0 On-Behalf-Of flow in Azure Entra - though I don't think MCP 2025-06-18 accounts for the RFC 8693) could be used to limit the MCP bot responses to what the current user is authorized to see.

sporkland · 2025-08-20T14:49:33 1755701373

If you have a field like and acl_id or some other context information on the data that is linked closely to a user's files. You can pass in the user's set of those field values to the vector database to pre-filter the results and do a permissions post check with a fairly relevant set.

The vector db definitely has to do some heavy lifting intersecting the say acl_id normal index with the nearest neighbors search but they do support it.

bitfilped · 2025-08-21T06:32:28 1755757948

I know ethics aren't high up on the list of things we're taught about in tech, so I'd like to take a moment and point out that it's your moral responsibility to remove yourself from a project like this (or the company doing it.)

AmazingTurtle · 2025-08-20T16:58:57 1755709137

No need for a per-user database, simply attach ACL to your vector DB (in my case I am use postgres, RLS for example or a baked ACL policy list if you're using opensearch for example)

gloosx · 2025-08-20T08:30:57 1755678657

What do you mean a problem? It's an AI man. Just ask it what to do man. It's thinking, it's really really big thinking. Big thing which does all big thinking. The multi-modal reasoning deep big thinking bro. Security, permissions, thats so important for you?? We have AI. It does thinking. What else do you need?? Because two brains are better than one. Your backlog doesn’t stand a chance. Get speed when you need it. Depth when you don’t. Make one change. Copilot handles the rest. It's your code’s guardian angel. It's AI bro. AGI is coming tomorrow. Delegate like a boss. Access rights and all the complex things can wait.

8b7875ff · 2025-08-20T17:21:58 1755710518

> As long as not ALL the data the agent hat access too is checked against the rights of the current user placing the request, there WILL be ways to leak data.

This is the way. This is also a solved problem. We solved it for desktop, web, mobile. Chatbots are just another untrusted frontend and should follow the same patterning to mitigate risks. I.E. do not trust inputs, use the same auth patterns you would for anything else (oauth, ect.).

It is solved and not new.

cryptonym · 2025-08-20T08:19:16 1755677956

True, per user doesn't scale.

Knowledge should be properly grouped and have rights on database, documents, and chatbot managed by groups. For instance specific user can use the Engineering chatbot but not the Finance one. If you fail to define these groups, feels like you don't have a solid strategy. In the end, if that's what they want, let them experience open knowledge.

9dev · 2025-08-20T12:02:13 1755691333

As if knowledge was ever that clear cut. Sometimes you need a cross-department insight, some data points from finance may not be confidential, some engineering content may be relevant to sales support… there’s endless reasons why neat little compartments like this don’t work in reality.

planb · 2025-08-20T09:25:10 1755681910

Yeah. If you have knowledge stored in a structured form like that, you don't need an AI...

cryptonym · 2025-08-20T09:45:15 1755683115

If organisation is that bad that finance docs are mixed with engineering docs, how do you even onboard people? You manually go through every single doc and decide if the newcomer can or can't access it?

You should see our Engineering knowledge base before saying an AI would be useless.