This guy generally does interesting work, but he's used an LLM to analyze the trends in a "creation science" journal over time, and I just don't think LLMs are effective for this kind of statistical task.
-
Were the results tokens emitted by the LLM or were the results generated by analyzing the model's weights?
The LLM is just going to emit text, I suspect in the hands of someone who knows what they're doing it might be possible to extract interesting insights from how the model is grouping terms.
"I suspect in the hands of someone who knows what they're doing it might be possible to extract interesting insights from how the model is grouping terms."
This is totally possible. But I don't think this is what that would look like?
-
This guy generally does interesting work, but he's used an LLM to analyze the trends in a "creation science" journal over time, and I just don't think LLMs are effective for this kind of statistical task.
Or have a missed something and they can count now?
Thought I'd ask before leaving a comment about the possible issue.
@futurebird He even starts with a disqualifying lie: “I analyzed”. No you (he) didn’t—he entered a parcel of text into a statistics-based machine that is not itself capable of statistical analysis.
-
@futurebird He even starts with a disqualifying lie: “I analyzed”. No you (he) didn’t—he entered a parcel of text into a statistics-based machine that is not itself capable of statistical analysis.
Damn thing will sit there and tell you that's what it's doing.
But it can't count! It still can't count. I feel like I'm going crazy. Am I the only person who cares that the machine can't even count?
-
This guy generally does interesting work, but he's used an LLM to analyze the trends in a "creation science" journal over time, and I just don't think LLMs are effective for this kind of statistical task.
Or have a missed something and they can count now?
Thought I'd ask before leaving a comment about the possible issue.
@futurebird if you're talking about using LLMs as a classifier for arbitrary text, I've seen yougov do it for some polls where they ask people about what they've read in the news recently and the LLM classifies what topics were mentioned, this ability is advertised here https://yougov.com/business/products/ai-qualitative-explorer
Also I've seen data science articles from the economist using basically the same idea on larger corpuses of text. I think empirically the best LLMs today are very good at modelling humans so this is ~fine? -
@futurebird if you're talking about using LLMs as a classifier for arbitrary text, I've seen yougov do it for some polls where they ask people about what they've read in the news recently and the LLM classifies what topics were mentioned, this ability is advertised here https://yougov.com/business/products/ai-qualitative-explorer
Also I've seen data science articles from the economist using basically the same idea on larger corpuses of text. I think empirically the best LLMs today are very good at modelling humans so this is ~fine?Wouldn't you need to ask it about each article individually and track the results?
Not just give it a stack of articles and ask "how many of the articles mentioned X" ?
-
Seems like the more people pride themselves on spotting nonsense, the more they seem to be advocating this shit these days. People have entered into this weird phase of mass AI hysteria and only those that don't use it at all are sitting here like... Am I crazy or are the hoards of AI enthusiasts crazy? It's gotta be one or the other.
-
Seems like the more people pride themselves on spotting nonsense, the more they seem to be advocating this shit these days. People have entered into this weird phase of mass AI hysteria and only those that don't use it at all are sitting here like... Am I crazy or are the hoards of AI enthusiasts crazy? It's gotta be one or the other.
I don't think this guy is an enthusiast, he's just using a tool in a way that seems reasonable and that seems to give the results he wants without really knowing what those results really represent.
-
Wouldn't you need to ask it about each article individually and track the results?
Not just give it a stack of articles and ask "how many of the articles mentioned X" ?
@futurebird yeah, that's what the correct thing to do would be, but it is still plausible that it could do the second, it's just more likely to make a mistake (though I think a task of this difficulty is pretty doable for current models with huge contexts (1M tokens), unlike older/cheaper models which had severe quality drop offs after maybe 10k tokens)
-
@futurebird yeah, that's what the correct thing to do would be, but it is still plausible that it could do the second, it's just more likely to make a mistake (though I think a task of this difficulty is pretty doable for current models with huge contexts (1M tokens), unlike older/cheaper models which had severe quality drop offs after maybe 10k tokens)
If it says there are 67 articles that mention topic X, but you don't know if that number is correct, it's just a guess based on context and the bulk of text (and LLMs are also bad at following commands such as "consider only these sources" ... ) what is the point of saying the number.
Maybe could you ask if a topic is mentioned "frequently" or "infrequently" but beyond that I think it's deceptive and useless.
-
Sorry I thought you were referencing the original post.
-
If it says there are 67 articles that mention topic X, but you don't know if that number is correct, it's just a guess based on context and the bulk of text (and LLMs are also bad at following commands such as "consider only these sources" ... ) what is the point of saying the number.
Maybe could you ask if a topic is mentioned "frequently" or "infrequently" but beyond that I think it's deceptive and useless.
@futurebird @Smoljaguar I’d do it with a loop. For each document, does it contain X, Y or Z? I’d end up with a table of document names and booleans.
-
@futurebird @Smoljaguar I’d do it with a loop. For each document, does it contain X, Y or Z? I’d end up with a table of document names and booleans.
I wonder if there is an API for any of the free models. Although I hate interacting with cloud APIs
-
Damn thing will sit there and tell you that's what it's doing.
But it can't count! It still can't count. I feel like I'm going crazy. Am I the only person who cares that the machine can't even count?
@futurebird @Moss
“ But it can't count! It still can't count. I feel like I'm going crazy. Am I the only person who cares that the machine can't even count?” -
I also feel deep incredulity towards this corporate-grade “confabulation”. -
F myrmepropagandist shared this topic
-
I mean LLMs are based on statistics, and they will produce results that look like frequency charts. But these charts only attempt to approximate the expected content. They aren't based on counting articles that meet any set of criteria.
It's... nonsense, and not even people who pride themselves on spotting nonsense seem to understand this.
@futurebird regrettably being that guy:
In context of how LLM deep research workflows are built, I do think you might need to show your work on this claim more than OP does
The model is not the only operative mechanism in such an investigation
In that approach, the model would be invoking (deterministic) tools that, among other things, could log instances of topic areas encountered within a corpus. OP says they are capturing abstracts and authors and grouping them by year. Objectively this category of work is something these tools can be built to do really well, including citations (to real, verifiable URLs). Statistical modeling tasks, including text analysis, can be offloaded to one-off scripts written and executed specifically for a requested job. Perhaps the model can’t tell you the “R” count in strawberry, but it can write Python which does quite well
Moreover, it is possible to objectively evaluate the performance of these tools for such tasks (and Anthropic, vendor of OP’s research tool, does this)
I mention all of this because I find this particular flavor of strawman quite pernicious: the limitations of the raw model architecture are entirely possible to mitigate through larger agent and tool scaffolding, and this work is constant, ongoing, and often quite effective. Critique of the technology and its vendors (essential) is meanwhile less effective when claims like this are so easily disproved by experience, usage, and public information
Here’s a bit more detail on the architecture point.
How we built our multi-agent research system
On the the engineering challenges and lessons learned from building Claude's Research system
(www.anthropic.com)
-
@futurebird regrettably being that guy:
In context of how LLM deep research workflows are built, I do think you might need to show your work on this claim more than OP does
The model is not the only operative mechanism in such an investigation
In that approach, the model would be invoking (deterministic) tools that, among other things, could log instances of topic areas encountered within a corpus. OP says they are capturing abstracts and authors and grouping them by year. Objectively this category of work is something these tools can be built to do really well, including citations (to real, verifiable URLs). Statistical modeling tasks, including text analysis, can be offloaded to one-off scripts written and executed specifically for a requested job. Perhaps the model can’t tell you the “R” count in strawberry, but it can write Python which does quite well
Moreover, it is possible to objectively evaluate the performance of these tools for such tasks (and Anthropic, vendor of OP’s research tool, does this)
I mention all of this because I find this particular flavor of strawman quite pernicious: the limitations of the raw model architecture are entirely possible to mitigate through larger agent and tool scaffolding, and this work is constant, ongoing, and often quite effective. Critique of the technology and its vendors (essential) is meanwhile less effective when claims like this are so easily disproved by experience, usage, and public information
Here’s a bit more detail on the architecture point.
How we built our multi-agent research system
On the the engineering challenges and lessons learned from building Claude's Research system
(www.anthropic.com)
Is that what the guy in the video is doing?
-
Is that what the guy in the video is doing?
@futurebird according to what he describes in the methods section of the video, he is doing an entirely plausible research task with a tool well suited to it, yes
The post I linked describes how it works
-
@futurebird according to what he describes in the methods section of the video, he is doing an entirely plausible research task with a tool well suited to it, yes
The post I linked describes how it works
OK but he's saying things about it counting articles (frequency) and when I used the same tool it could not do that accurately. It couldn't even follow a command to restrict the dataset. It do not sound like he used some kind of API to make this kind of task possible.
-
@futurebird @Moss
“ But it can't count! It still can't count. I feel like I'm going crazy. Am I the only person who cares that the machine can't even count?” -
I also feel deep incredulity towards this corporate-grade “confabulation”.It’s a shame that it lists summarisation as something LLMs are good at, when all of the studies that measure this show the opposite. LLMs are good at turning text into less text, but summarisation is the process of extracting the key points from text. LLMs will extract things that are shaped in the same way as a statistically large number of key points in the training set but they don’t understand either the text of the document or your context for requesting a summary and so are very likely to discard the thing that you think is most important. They also have a habit of inverting the meaning of sentences when shrinking them.
-
It’s a shame that it lists summarisation as something LLMs are good at, when all of the studies that measure this show the opposite. LLMs are good at turning text into less text, but summarisation is the process of extracting the key points from text. LLMs will extract things that are shaped in the same way as a statistically large number of key points in the training set but they don’t understand either the text of the document or your context for requesting a summary and so are very likely to discard the thing that you think is most important. They also have a habit of inverting the meaning of sentences when shrinking them.
@david_chisnall @dahukanna @Moss
Why do I have to write the software guide for Google and Sora?