mastodon.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
The original server operated by the Mastodon gGmbH non-profit

Administered by:

Server stats:

336K
active users

camwilson

As part of a test run for Australia's corporate regulator, AI was used to summarise submissions made by the public.

The trial found that AI performed worse in every metric compared with humans. Assessors suggested AI would make more work for people, not less. crikey.com.au/2024/09/03/ai-wo

Crikey · AI worse than humans in every way at summarising information, government trial findsBy Cam Wilson

@camwilson it's the conclusion that counts "“This finding also supports the view that GenAI should be positioned as a tool to augment and not replace human tasks,” also this is 1 test, my impression is that in some contexts summary actually works quite well, for example if you need to quickly scan a large number of texts/pdfs without reading them.

@ErikJonker @camwilson if you need to quickly scan a bunch of documents and accuracy isn't necessary, I'd argue you don't in fact need to scan those documents at all.

@Beeks @ErikJonker @camwilson

Or you need that old tech that used to be called "database"...

@ErikJonker @camwilson

They pre-tested AIs, selected the one that seemed most promising, and it still scored 47% vs. the humans' 84%.

"The reviewers’ overall feedback was that they felt AI summaries may be counterproductive and create further work because of the need to fact-check and refer to original submissions which communicated the message better and more concisely.

"[...]the trial showed that a human’s ability to parse and critically analyse information is unparalleled by AI[.]"

@pjohanneson @camwilson ..they used Llama2-70B , compare it on this recent leaderboard with the newest models, then the score is not that bad and probably with a newer model (like GPT-4o or Claude Opus) much much better ?
I am not a blind believer in generative AI but the difference is meaningful (34 vs 77 quality index)
artificialanalysis.ai/leaderbo

artificialanalysis.aiLLM Leaderboard - Compare GPT-4o, Llama 3, Mistral, Gemini & other models | Artificial AnalysisComparison and ranking the performance of over 30 AI models (LLMs) across key metrics including quality, price, performance and speed (output speed - tokens per second & latency - TTFT), context window & others.

@camwilson I take it when it comes down to the 'fast, cheap or good' AI only does one of those.

@ariaflame @camwilson
Unfortunately, good isn't on the menu.
How about mediocre?

@camwilson I just ran six of my own blogposts on misconceptions in biology and biochem through asking for summaries. The first was accurate and useful; the second one wasn't inaccurate but didn't mention the key vocabulary I was critiquing; the third said I was arguing *for* the very thing the post explictly argued *against*; the next two were just vague tangential waffle, and the last one said the post was about a completely different aspect of the enzyme to that the post discussed 💩

@eibhear

It's just generating a plausible continuation of the prompt, where that includes a blogpost using the words "activation energy" and "enzymes". Since the training data set contains millions of strings saying "enzymes reduce the activation energy of a reaction", it just pukes that out, even though that's the exact idea I'm critiquing. It's not summarising, it's just Markov-ing its way through the training data starting from the prompt words.

@camwilson AI doesn't summarise, it compresses text. Summarising implies an understanding of the subject, AI has no understanding of anything.

@camwilson "The most promising model, Meta’s open source model Llama2-70B, was prompted to summarise the submissions" 🤨

It is almost like it matters which model one chooses for the task at hand.

@camwilson
"The trial found that AI performed worse in every metric compared with humans."
Easily solved! Just have AI evaluate the AI summaries 😁

@camwilson which is literally the predictable outcome!

@camwilson

Homer: Kids, there's three ways to do things. The right way, the wrong way, and the Max Power way.
Bart: Isn't that just the wrong way?
Homer: Yes, but faster!

@camwilson This is just silly. I can paste a 10 page pdf into AI and get a 1 page summary that's enough to decide if I need to read all 10 pages or not.

Or... I suppose I'd get a better summary if I paid a full time summary writer with a background in technical database design structures.... because that would be cheap.

I dunno. Maybe I'm lazy and cheap, but so far I've found the usefulness is pretty high and the drawbacks are generally from people who are trying to use it for things that it's just not good at. "Gee, this bicycle is garbage because it doesn't fit 5 people and go 90mph".

What DOES concern me is that this will be like Netflix or BlockBuster, where it's great and cheap.... but then once the competition is winnowed down, it's going to be "how much can I charge for this" and "Here's the idiot for free, but if you want an AI who's actually useful, and you're now dependent upon, it's $50/month.

@coldfish @camwilson Except that the “AI summary” cannot be relied on to be accurate to the point of stating the opposite of the original text.
Wouldn’t you have to spend more time verifying that if accuracy was of any concern?

@coldfish @camwilson So, what you are saying that you are trusting an AI, without double checking it.

@cerberus1746 @camwilson
To some degree, sure.

If I have 15 documents that are 10 pages each, where they talk about how they plan to integrate my accounting system with my operations system, I'm going to have them summarized and broken into bullet points. I'll then discard the ones that are clearly not going to work, and then read through 2 or 3 that make sense.

And no, I'm not going to read the discards. So, I suppose that's "trusting AI without double checking it".

There's always going to be that level of trust that you give. I trust a car to stop when I hit the breaks. I trust my calculator to give me the right answers to complicated math I'm not going to double check. I trust my email to send mail without calling the person on the other end to see if they got it.

AI, like any other tool, is going to require some level of trust.

@camwilson
> The trial found that AI performed worse in every metric compared with humans

Is anyone doing an equivalent of web3isgoinggreat.com for the "AI" hype bubble?

@Fruan