Hacker Newsnew | past | comments | ask | show | jobs | submit | GraffitiTim's commentslogin

Exa (YC S21) is trying to solve this problem by re-indexing the web in an LLM-friendly way.


2021? how are they doing?


This is why we need Freshpaint (YC S19) for analytics and other services for healthcare companies. A primary focus on regulatory compliance, privacy, security.


These aren’t the key problem.

The key problem is that you’ve lost the trust of the authors you want to attract. It’s no longer a place I can post and know that my content will be cleanly accessible to readers. I now think you’ll pepper it with pop ups and account demands.

It went from being a minimalist and trusted place to post, to now a feeling of feeding my own content into someone else’s machine and losing control of it.


Also: if you've been the new CEO since last July and haven't figured this out, you'll fail to save Medium.

You mentioned paying attention to the sentiment and linked to the HN survey where people explained why they don't like Medium. The top comment was someone explaining this exact reason.

It's highly admirable that you are on here trying to listen, communicating issues transparently, and working to fix problems. But I think you need to listen even more deeply.

Unfortunately this will push you into the depths of the business model that you won't want to change, but is the fundamental reason for medium's eventual failure.

Right now is the moment to save it, as you read this note!


For me, it’s more like: if I write on medium, I know my stuff is being put right next to utter shit. I take writing seriously, and it honestly looks bad to be in the platform, because almost everything you see on there is so bad.


I think that’s a pretty good summary. If I write something for the public to enjoy, I’d better to it on my own site. They may never actually find it, but at least they’d be able to access it when they do.


I mean the key problem is surely that most readers don't want to pay to read blogs, so you can't really fund a large business from it. Medium has 180 employees apparently, which tbf is less than I expected. But still, it's a very simple site. Really it should be "finished" and running on like 50 employees at the most. You could then probably find it by relatively reasonable advertising instead of paywalls. Or potentially charge authors for features like image hosting.


AI will also be able to fill in dialog, plot points, etc.


This was a very nice thing to say.


They super earned it. From day one, everyone showed up with a level of drive and determination I haven't seen elsewhere.

My name is on The Pile paper https://arxiv.org/abs/2101.00027 but I didn't do anything except make the books3 dataset. Stella, Leo, and everyone else did the hard work. You know, the work that's "actually useful to the scientific community." I didn't even help them hunt for typos, even though Stella asked me to. I was just like, sorry, no time, I have to focus on my own research.

Imagine saying "nah" to helping shape one of the most important open source AI research projects of the coming years. Training data quality is becoming more and more of a focus.

Lemme tell you a quick story.

When https://venturebeat.com/2021/06/09/eleutherai-claims-new-nlp... come out, this quote caught my eye:

> But EleutherAI claims to have performed “extensive bias analysis” on The Pile and made “tough editorial decisions” to exclude datasets they felt were “unacceptably negatively biased” toward certain groups or views.

When I read this, I felt astonished that Eleuther was yet again trying to pose as the cool super-progressive AI lab. To my knowledge, no such thing ever happened. And I was involved with The Pile back when it was just me and Leo memeing in Discord DMs about how the world needed some quality training data once and for all.

I went to Stella in DMs (you should follow her too! https://twitter.com/BlancheMinerva/status/139408950872390042...) and was like, what the hell? I don't understand how this could possibly be true. What are these supposed "tough editorial decisions"?

Stella calmly explained to me that the US Congressional Record had been considered and rejected for inclusion in The Pile. I thought "Big deal, who the hell cares?" while saying "Okay, but I don't know what that is."

It’s a written record of all statements made in the US legislature. It was also somewhere between 1GB and 15GB, which would have been a significant portion of The Pile's total size.

I'm going to quote from her private DMs with me, which I haven't asked for permission to do. So this is technically another bad move by me. But she put it so perfectly, I was stunned:

> For half the history of the US, black people were slaves. For something like 75% of it, black people didn’t have the right to vote. A modern reader didn’t think there wasn’t a high proportion of extremely racist content, that would primarily be an inditement of modern people lol.

> The reason we first looked at it was that we included a similar document for the EU Parlement

It took me a few minutes to come to my senses, but I finally realized:

(a) this dataset likely contained a huge proportion of content that, politics aside, would be a Very Bad Idea to include in your ML models by default;

(b) Eleuther had just been trying to do good work this whole time

So you know, when you're in that situation, you can choose to either keep believing your own false ideas, or you can pay attention to empirical evidence and change your behavior. And empirically, I had been a massive asshole to everyone since pretty much the beginning. The only thing I helped with was books3 and arranging The Eye to get them some reliable hosting. (Shoutout to The Eye, by the way. Help 'em out if you can: https://the-eye.eu/public/AI/)

And there's my name, right there on the paper.

It's even worse than I described. I put the paper in jeopardy, because they were submitting it to a conference with strict anonymity rules. I had no idea about it (no one told me). I ended up so happy to see my name on a real arxiv paper that I tweeted out some self-congratulatory bullshit, and quote-tweeted something linking to The Pile. It was a few days into the anonymity period, but nonetheless, it was a violation of the anonymity rules. A lot of people saw that tweet, and the whole point of the rules is to ensure that people don't get unfair advantages by advertising on social media.

When they came to me in DMs apologizing profusely for not talking with me about it, and asking me to delete the tweet, I basically told them to go shove a spoon up their.... because I didn't agree to any rules, and the idea that The Pile should go radio silent for five months on social media struck me as completely crazy.

In hindsight, I was... just awful. So I mean, me posting this is like, the absolute minimum I can do. They've been the ones working for like a year to make all of this happen. Ended up feeling like a fraud, since everyone thinks highly of my ML work, and here I'd been nothing but problematic for a group of people who are just trying to ship good scientific work.

Fast forward to today, and the results are clear. Go help Eleuther: https://www.eleuther.ai/ They're cool, and you'll get a shot at changing the world. I'm not sure you even have to be particularly skilled; some of the most valuable work was done by people who just showed up and started doing things, e.g. making the website look a little nicer, or making a cool logo.


This is probably one of the best apologies I've ever read.


The quote from the direct message made me respect Eleuther much more. Largely because I had no idea such ethical considerations were even being made.

Understanding the biases of these datasets is clearly more nuanced than I realized and I'm glad Stella had a nuanced understanding here.


Exactly. This was the type of mistake that OpenAI could easily have made. I could see myself including this historical dataset without giving it a second thought. After all, the more data, the better, right?

One of The Pile's goals was to point out how tricky that can be. We've all seen how effortlessly Copilot spits out GPL code by rote; one wrong prompt would be all it takes to start spewing a lot things that no one wants to hear, if you have the wrong sort of data.

When you train with The Pile, you know exactly what you're getting, because you can take whatever parts you want and ignore the rest. It's a modular dataset. But defaults still matter -- by default, everyone will train on everything. Maybe OpenAI trained on the wrong thing, and maybe that's why they're forcing everyone to use their filters now. Whereas people can "just go train on everything in The Pile" and not have to worry.

(Once upon a time, the plan was to include a dump of Literotica in The Pile, which you can still find here: https://the-eye.eu/public/AI/pile_preliminary_components/ I argued heavily in favor of this, and thought it was totally lame when they decided to drop it.

In hindsight, that was a close call. AI Dungeon proves that it's easy to carelessly include things that can bite you later: https://gitgud.io/AuroraPurgatio/aurorapurgatio#aurorapurgat...

Maybe some people want their models to include that sort of thing, but it shouldn't be the default. People shouldn't have to worry that the defaults will be "Whoa, I only wanted to make a Q&A system for my business; why is it reciting love poems?"

Stella saw that, I think. I didn't.


So what’s the rationale for including so much “romance” literature in The Pile? My innocent “walk in the park” prompt turned extremely graphic for no apparent reason.


Unfortunately, that's probably my fault.

I foolishly had a big head, and felt like it was so clear what needed to happen: we needed a dataset of "every book ever."

books3, one of the largest components of The Pile, is 196,640 books. https://twitter.com/theshawwn/status/1320282149329784833?lan...

I'm proud I did that. And I'm also horrified that my perspective was so incredibly off-base. I get it now. I was blinded by my own thick skull.

The sheer quantity of knowledge in books3 is almost unfathomable. I find it hard to think too much about it, because you end up concluding that AIs are the only entity on earth that stand a chance of absorbing this much knowledge.

I just pulled up the books3 index of "2" -- i.e. all books starting with the number 2: https://gist.github.com/shawwn/85cbaf53cb6bb57c49f1688e70532...

That's the truncate file. If you go to the full file, then command-F for "sex", there are 93 hits.

93 sex books. In just the "2" section.

All the sections are here: http://the-eye.eu/public/Books/Bibliotik/

Like Hattori Hanzo, I feel I can say with no ego that books3 is my finest work. https://www.youtube.com/watch?v=az2dSNXRKOc&ab_channel=kurts...

You would not believe how hard it is to get 193 thousand books converted into perfectly-readable markdown. Even the software books have perfect formatting -- every table, every code snippet, I annihilated every corner case I could find. Because it needed to be perfect for humans, to have any chance of being perfect for AI.

But I was a fool. My ego blinded me to the fact that it's a bad idea to do what I truly believed was in everyone's best interest: that "because any human could read any of those books, AI should know all of those books."

It's not a human. It's a markov chain. Having it autocomplete sex books is a bad idea for business purposes. I wanted The Pile to be business-grade. My work here has endangered that goal.

And I don't know how it could have ended up any differently. Because I don't know how to sort 193 thousand books into reasonable selections that you may or may not want to exclude. Our goal with The Pile was to let you decide. Who among us would dare feel that they could judge 193 thousand books from their titles alone?

It's a job for filtering and heuristics and analysis and hard work -- none of which I did. I spent around three days turning Aaron Swartz' html2text library into the best damn "epub to training data converter" ever made. Yet my accomplishments feel so hollow, for the reasons you observed here.

Stella and Leo put so much more thought and care into their contributions. I try to take solace in the fact that The Pile lets you pick and choose which portions of training data you want to use: https://github.com/EleutherAI/the-pile

But of course, the irony is, even though The Pile is so flexible and modular, most people will just use the defaults. And by default, The Pile includes.... most of humanity's knowledge. A gargantuan pile of books. So many books that you could fill an entire neighborhood with nothing but books, and you'd still have a hundred thousand books left over.

I don't know how to feel about all that. I wanted to make an impact. I guess I did. Time will tell whether it's a net gain.

Luckily, OpenAI made these same mistakes. That's the grain of truth I cling to. They almost certainly made these exact same mistakes, because their goal was to make a million dollars a year (which they achieved), and to do so as quickly as possible.

Now they have to be super paranoid with their filters, and GPT-J is at least slightly less shocking than GPT-3 thanks to everyone not-me who worked on The Pile.


> > EleutherAI claims to have performed “extensive bias analysis” on The Pile and made “tough editorial decisions” to exclude datasets they felt were “unacceptably negatively biased” toward certain groups or views.

> When I read this, I felt astonished that Eleuther was yet again trying to pose as the cool super-progressive AI lab.

So they traded biases inherent in the dataset for intentionally introduced biases. Does not sound super progressive to me, to be quite honest.

Focus on you research, do not try to be the morality judge and jury…


My memory is there were some stores that had "customer reviews" before Amazon, but they were all screened by the company before being published. So every product would only have 5 star reviews.

When Amazon came out with real reviews, it seemed crazy to many at the time -- like why would you want to share bad things about your products on your own website? I thought it was awesome.


OK, so now this hinges on the idea that such stores had an online e-commerce-y presence before amzn. Again, I'm not ruling it out, but I don't remember us having any real models for this.

Amazon's customer review system was created as a barrier to entry for competitors, as much as anything else.


Your memory would be more credible than mine certainly!

Perhaps I'm remembering my/people's reaction to first seeing reviews on Amazon.


My daughter (who just turned 4) seemed like she might like programming, so I started out having her "program" a stuffed walrus, by telling it whether to go forward, backward, left, or right to get to a piece of food. Her natural inclination was to point to where it should go, so I first taught her that the walrus doesn't understand pointing or the word "here", just the directions.

Then we started "programming each other" by telling each other where to go, and I introduced doing multiple steps at once (like "step forward 5 times").

My goal wasn't literally to teach her to program, but just to introduce that way of thinking, which is pretty different from how we normally think in day-to-day life.

She was excited about it, so I got her the Osmo programming kit for iPad. You program a little monster walking around, using physical, scratch-style code blocks. She's been excited about programming the monster every day, and is able to (sometimes) do some short programs of a couple blocks.

If she learns a bit more, my plan is to show her how to program a simple lego robot with scratch, like one that spins a flag when it sees something pink. I love that idea because with 2 lines of code you can make something really happen in the world, plus she'll be able to come up with new ideas for the robot on her own, and learn about the constraints, sensors, and eventually more basic programming logic.


Somewhat related: At one point I introduced my kids to what I called the number-machine-game. It goes like this:

I ask my kids to say a number, and then I do something to that number and tell them the answer.

Their job is to figure out the calculation I do.

Examples:

- I add a number: they say 3, I say 8. They say 11, I say 16

- I multiply a number.

- I multiply by something and add something else.

- etc

If you want to drive them nuts you can count the letters of the number, i.e. four = f o u r = 4, five = f i v e = also 4, ten = t e n = 3 etc :-D

(I might have gotten the idea from HN, but the above is how I taught it.)


I remember one day in primary school a teacher had us play a game where we had to propose things to take on vacation. People would call out a suggestion, and the teacher would say yes or no.

The aim of the game was to figure out the rule. She said no to 'novel' and 'money', but yes to 'book' and 'currency'.


Spoiler: Is it the number of letters? That is, only an even-length string is permitted?


I've used the same before, I believe also inspired by a similar HN comment. I was delighted to find that it's actually been developed into a complete game as well[1]

However to add my 2c to the conversation how I introduced programming to my cousins and niblings was through baking! A recipe is just a program for a delicious outcome which helps a bit with motivation while also learning a related skill.

[1] https://www.cinqmarsmedia.com/devilscalculator/


For my daughter (she's now 7), I followed this sequence:

- Exact instructions challenge (from YouTube)

- LightBot app on Android

- Scratch with Harvey Mudd College's course on edX

Snap! has some nice features but the community aspects of Scratch are so much better that she's happy building games there.

Same as you, our goal was not to "learn programming", but just to have fun making things move with your ideas. Just creating rather than passively consuming something.

Because this "coding for kids" mania seems to have gone overboard, I collected links to all the resources I used in the form of a "syllabus" here: https://learnawesome.org/items/1c96e03a-ffff-4579-b69a-0387b...


Is this the YouTube video you mention above? https://www.youtube.com/watch?v=cDA3_5982h8


Yes. This is basically teaching how to give instructions to a robot that does exactly what you tell it to - nothing more and nothing less! Recipes are nothing but simple algorithms including branches and loops, may be even procedures.


I bought this for my 4yo: https://www.youtube.com/watch?v=XlnP-8SczF0

Sorry the vid is in Polish but I think you can get the idea.

It is a talking robot that does "missions" where you have to program it to go forward, backward or turn, take objects from the map, etc.


You reminded me of a robot I had as a kid in the early eighties. It had some rubber keys on the top (similar texture to the rubber buttons on a Sinclair Spectrum, but smaller). You could program it with a sequence of moves (IIRC forward, turn right, turn left, pause).

The robot you linked looks great, with the addition of the missions. But I'm not sure whether it's available in the US, so I ordered this one instead: https://www.amazon.com/Fisher-Price-Code-n-Learn-Kinderbot/d...


If you want the non-digital game for this: RoboRally https://boardgamegeek.com/boardgame/18/roborally

It's actually pretty fun, even for grown-up coders ;)


Love the idea! I'll try that with my 4yo daughter this week, I think she'd love to tell me what to do :P


The Osmo is cool, it's playful and feedback is great.


My 3 year old daughter really enjoyed this! Very nicely done.


So glad to hear! Thank you. :)


We've been using Rally for our happy hours, and internal meetings where there are more than ~5 people. It's great being able to easily hop between tables as conversation topics shift. It brings back a lot of that serendipity that we're all missing from in-person events.

In fact, I find it allows for more serendipity than a lot of in-person events. In person, once you're sitting down at a table it's socially awkward to get up and move to another table. Mostly you just talk with whoever you're sitting next to or across from. With Rally it's much more fluid. I find myself hopping around all the time.

In our internal meeting this week we split into groups of 3, and then took turns jumping on stage to report back to the group.

Disclaimer: we invested


Thanks Tim!


I imagine some kind of VR tracker for the Quest will come out eventually. Something like the Vive trackers but smaller. Agree, it would open up all kinds of interesting applications.


They could be wildly successful if instead of $100 pucks they could just sell $1 stickers that you could attach to whatever you want. With enough of them you could probably use it to aid scanning in geometry of real world objects


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: