OSI invites individuals and organizations to endorse the Open Source AI Definition (OSAID). Endorsers will have their name and affiliation listed in the press release for Release Candidate 1 (RC1), which is expected to be finalized by the end of September. Those endorsing version 0.0.9 will be contacted again to confirm their support if there are any changes leading up to RC1.
@mjbommarencourages reviewing the U.S. Copyright Office’s guidance on text and data mining (TDM) exceptions, which provides clear explanations and limitations, especially focusing on non-commercial, scholarly, and teaching uses. He emphasizes that the TDM guidance operates within narrow parameters that are often misunderstood or overlooked.
@quaidproposes adding nuance to the Open Source AI (OSAI) Definition by introducing two designations: OSAI D+ (with open data) and OSAI D- (without open data, due to legitimate reasons beyond the creator’s control). He suggests using a dataset certificate of origin (dataset DCO) for self-verification to ensure compliance.
@kjetilkagrees that verification is key but questions whether data information alone is sufficient for verification. He highlights that verifying rights to the data may not always be possible.
@stefanoappreciates the quadrant system’s clarity and confirms @quaid’s proposal for OSAI D- to be reserved for those with legitimate reasons for not sharing data.
@thesteve0expresses skepticism about broadening the “Open Source” label. He argues that without access to both data and code, AI models cannot truly be Open Source and suggests labeling such models as “open weights” instead.
@shujisadonotes the importance of data access in AI, pointing out that OSAID requires detailed information about how data is sourced, including provenance and selection criteria. He also discusses potential legal and ethical reasons for not sharing datasets.
@Shamarraises concerns about “openwashing” in AI, where developers might distribute a model with a different dataset, undermining trust. He argues that distinguishing between OSAI D+ and D- risks legal complications for derivative works, suggesting that models without open data should not be considered truly open.
@zacksupports the idea of a tiered system (D+ and D-) as an improvement over the current situation, as it incentivizes progress from D- to D+. He is skeptical about verifiability but sees potential in the branding aspect of the proposal.
@stefanoasks @arandal about suggested edits, which include renaming data as “source data,” allowing open-source AI developers to require downstream modifications with open data, and permitting downstream developers to use open data to fine-tune models trained on non-public data. He further asks if arandal compares training data to model weights as source code is to binary code.
@shujisadoagrees with @stefano and points out that while many interpret OSD-compliant licenses to include CC4 and CC0, OSI has not officially evaluated Creative Commons licenses for compliance. He highlights concerns about CC0’s patent defense, which could be crucial for datasets.
@mjbommarechoes the concerns about patent defense, noting it as a critical issue in both software and data licensing.
@Shamarsupports the first two suggestions but argues that models trained on non-public data cannot meet an “Open Source AI” definition, as they limit the freedom to study and modify, which are core principles of Open Source.
@nickshares an article by Nathan Lambert, reviewed by key figures in the Open Source AI space, discussing the challenges of training data and the current Open Source AI definition. @Percy Liang (on X) view is highlighted, where he suggests that releasing an entire dataset is neither sufficient nor necessary for Open Source AI. He emphasizes the need for detailed code of the data processing pipeline for transparency, beyond just releasing the dataset.
@shujisadodiscusses the legal nuances of using U.S. government documents in AI training, emphasizing that while they may be used in the U.S., legal complications arise in other jurisdictions.
@Shamarstresses that Open Source AI should provide all the necessary data and processing information to recreate a system, otherwise, calling it Open Source is “open washing.”
@Shamarproposes a clearer distinction between “source data” and “processing information” in the Open Source AI definition to ensure transparency and reproducibility. He suggests source data should be publicly available under the same terms that allowed its original use, while the process used to train the system should be shared under an Open Source license. His formulation aims to prevent loopholes that could lead to open-washing and emphasizes the importance of granting all four freedoms (study, modify, distribute, and use) to qualify as Open Source AI.
@nickdisagrees, arguing that @Shamar proposal misunderstands the difference between the rights to use data for training and the rights to distribute it. He also challenges the claim that exact replication of AI systems can be guaranteed, even with access to the same data.