this post was submitted on 05 May 2024
59 points (96.8% liked)

Australia

3846 readers
32 users here now

A place to discuss Australia and important Australian issues.

Before you post:

If you're posting anything related to:

If you're posting Australian News (not opinion or discussion pieces) post it to Australian News

Rules

This community is run under the rules of aussie.zone. In addition to those rules:

Banner Photo

Congratulations to @[email protected] who had the most upvoted submission to our banner photo competition

Recommended and Related Communities

Be sure to check out and subscribe to our related communities on aussie.zone:

Plus other communities for sport and major cities.

https://aussie.zone/communities

Moderation

Since Kbin doesn't show Lemmy Moderators, I'll list them here. Also note that Kbin does not distinguish moderator comments.

Additionally, we have our instance admins: @[email protected] and @[email protected]

founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 10 points 9 months ago (2 children)

Hey @[email protected], the bot seems to be having some difficulties with correctly parsing articles from the ABC. It's been doing it on a fair few posts (see below examples as well). As far as I can tell, it's only occuring on articles from the ABC and I'm not entirely sure what's causing it.

Other examples:

https://lemmings.world/comment/8105800

https://lemmings.world/comment/8196693

[–] [email protected] 3 points 9 months ago (1 children)

Thanks for the report! It's fixed now.

[–] [email protected] 1 points 9 months ago
[–] [email protected] 2 points 9 months ago* (last edited 9 months ago) (1 children)

It looks like ABC must have changed the internal layout of their pages for whatever reason. It seems like the bot is just selecting the first block quote as the entire article.

On The Register for example it selects the div with the id #body. For ABC it seems that it looks for the class Article_Body which I can't find on that article. I might have a closer look later if I've got some time and try to get a PR in if it doesn't get fixed.

[–] [email protected] 3 points 9 months ago (1 children)

That's the case, they removed one level of nesting from the html. Anyway, it doesn't look for Article_Body class, but any class that starts with Article_Body. They're using randomized class names with the prefix being constant, that's why I have to do it that way. I've updated it to this horrible looking selector: div[class*="Article_body"] > div > p, div[class*="Article_body"] > div > ul:not([class*="ShareUtility"]) > li.

[–] [email protected] 2 points 9 months ago

Thanks! I thought it might've been a wildcard thing but wasn't sure. They really don't want their articles summarised do they (or they're probably trying to discourage AI scrapers)