Show Navigation
Notices tagged with llm
-
Dear Creative Commons (@creativecommons.org @creativecommons@mastodon.social @creativecommons@x.com),
Can we have CC-NT licenses for no-training (ML/LLM, GenAI in general), just like we have CC-NC for non-commercial?
My previous post¹ reminded me that I’ve been creating, writing, inventing, and then sharing things with #CreativeCommons (CC) #licenses for a long time (I have to see if I can dig up my first use of CC licenses.)
I’ve used and recommended a variety of CC licenses for decades, e.g.
* CC0 — for standards work, e.g. I drove and wrote up https://wiki.mozilla.org/Standards/licensing (with help from lawyers)
* CC-BY — aforementioned blog post (and other snippets of #openSource)
* CC-BY-NC — photos on Flickr (dozens of which have been used in publications²)
* CC-SA — for CASSIS³, which I still consider experimental enough that I chose "share-alike" to deliberately slow its spread, and hopefully reduce mutations (while allowing ports of its functions to other languages)
So I have some idea of what I’m talking about.
There have been LOTS of discussions of the challenges, downsides, and disagreements with sweeping use of copyrighted content to train generated artificial intelligence AKA #genAI software and services, sometimes also called #machineLearning. The most common examples being Large Language Models AKA #LLM, but also models for generating images and video. Smart, intelligent, and well-intentioned people disagree on who has rights to do what, or even who should do what in this regard.
There have been many proposals for new standards, or updates to existing standards like robots.txt etc. but I have not really seen them make noticeable progress. There are also lots of techniques published that attempt to block the spiders and bots being used to crawl and collect content for GenAI, an arms race that ends up damaging well-established popular uses such as web search engines (or making it harder to build a new one).
The brilliant innovation of Creative Commons was to look at the use-cases and intentions of creators publishing on the web in the 2000s and capture them in a small handful of clear licenses with human readable summaries.
Creatives are clamoring for a simple way to opt-out of their publicly published content from being used to train GenAI. New Creative Commons licenses solve this.
This seems like an obvious thing to me. If you can write a license that forbids “commercial use”, then you should be able to write a license that forbids use in “training models”, which respectful / well-written crawlers should (hopefully) respect, in as much as they respect existing CC licenses.
I saw that Creative Commons published a position paper⁴ for for an IETF workshop on this topic, and it unfortunately in my opinion has an overly cautious and pessimistic (outright conservative one could say) outlook, one that frankly I believe the founders of Creative Commons (who dared to boldly create something new) would probably be disappointed in.
First, there is no Creative Commons license on the Creative Commons position paper. Why?
Second, there are no names of authors on the Creative Commons position paper. Why?
Lots of people similarly (to the position paper) said the original Creative Commons licenses were a bad idea, or would not be used, or would be ignored, or would otherwise not work as intended. They were wrong.
If I were a lawyer I would fork those existing licenses and produce such “CC-NT” (for “no-training”) variants (though likely prefix them with something else since "CC" means Creative Commons) just to show it could be done, a proof of concept as it were that creators could use.
Or perhaps a few of us could collect funds to pay an intellectual property lawyer to do so, and of course donate all the work produced to the commons, so that Creative Commons (or someone else) could take it, re-use it, build upon it.
Someone needs to take such a bold step, just as Creative Commons itself took a bold step when they dared to create portable re-usable content licenses that any creator could use (a huge innovation at the time, for content, inspired in no doubt by portable re-usable open source licenses⁵).
References:
¹ https://tantek.com/2024/263/t1/20-years-undohtml-css-resets
² https://flickr.com/search/?user_id=tantek&tags=press&view_all=1
³ https://tantek.com/github/cassis
⁴ Creative Commons Position Paper on Preference Signals, https://www.ietf.org/slides/slides-aicontrolws-creative-commons-position-paper-on-preference-signals-00.pdf
⁵ https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses
-
No large language models (LLM) were used in the production of this post.
Inspired by a subtle but clear sign-of-the-times one-line disclaimer at the end of RFC9518’s Acknowledgments (https://www.rfc-editor.org/rfc/rfc9518.html#appendix-A-4)
“No large language models were used in the production of this document.”
I have added a similar disclaimer to the footer of my homepage:
“No large language models were used in the production of this site.”
2023 was certainly a year that LLMs took off and stole the hypecycle from #metaverse and #blockchain before that.
Yet unlike those previous two, #LLMs are already having real impacts on the way people create (from emails to art), communicate (LLM chat apps), and work (2023 Writer’s Strike), fueling growing concerns about the authenticity of content, especially content from human authors.
I expect we will see more such disclaimers in the future.
For now, if you blog on your own site with words written by you not #ChatGPT or a similar tool, I encourage you to add a similar disclaimer, and then add your site as an example to the #IndieWeb wiki:
* https://indieweb.org/LLM#IndieWeb_Examples
#largeLanguageModel #LLM #generativeAI #AI
There is the related problem of, when you discover what seems to be an independent site written by a human, how do you know that human actually exists?
For now I’ll mention that XFN rel=met links, published (e.g. metrolls / met-rolls), aggregated, indexed, and queried, can solve that problem. This will be similar to how XFN rel=me links solved #distributed verification on the web (see https://tantek.com/2023/234/t1/threads-supports-indieweb-rel-me and posts it links to).
This is day 48 of #100DaysOfIndieWeb. #100Days
← Day 47: https://tantek.com/2023/365/t1/capture-first-edit-publish-later
→ 🔮
Post glossary:
blockchain
https://indieweb.org/blockchain
large language model / LLM
https://indieweb.org/large_language_model
metaverse
https://indieweb.org/metaverse
rel=me
https://indieweb.org/rel-me
rel=met
http://gmpg.org/xfn/11#met
XFN
https://gmpg.org/xfn/