SourceHut continues to face disruptions due to aggressive LLM crawlers. We are continuously working to deploy mitigations. We have deployed a number of mitigations which are keeping the problem contained for now. However, some of our mitigations may impact end-users.

  • thatsnothowyoudoit@lemmy.ca
    link
    fedilink
    English
    arrow-up
    13
    ·
    edit-2
    3 hours ago

    We use NGINX’s 444 on every LLM crawler we see.

    Caddy has a similar “close connection” option called “abort” as part of the static response.

    HAProxy has the “silent-drop” option which also closes the TCP connection silently.

    I’ve found crawling attempts end more quickly using this option - especially attacks - but my sample size is relatively small.

    Edit: we do this because too often we’ve seen them ignore robots.txt. They believe all data is theirs. I do not.

  • Treczoks@lemmy.world
    link
    fedilink
    English
    arrow-up
    13
    ·
    4 hours ago

    I wonder how much of the load problems I observe with lemmy.world are due to AI crawlers.

  • Roguelazer@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    4 hours ago

    The companies that run these residential proxy networks are sketchy as shit and in a better world would be criminally prosecuted. They’re tricking random low-information users into installing VPNs and other software with backdoors that turn them into a veritable botnet.