
Unauthorized Data Usage by AI Giants: YouTube and Social Media Content Exploited for Training
The video discusses the unauthorized use of data by major AI models, particularly OpenAI, to train their systems like Whisper. Anomalies in automatic subtitles (e.g., "Subscribe" inserted without vocal source) reveal that these models were trained on YouTube and social media content without consent. The data used includes transcriptions, videos, and user-generated content, with estimated costs ranging from $100 to $300 per hour for manually annotated data. OpenAI and other players (Meta, Google) resort to partnerships or scraping methods to bypass restrictions, despite legal risks (e.g., a $1.5 billion fine for Anthropic). Freemium models (like free ChatGPT) are also used to collect user data. Challenges include dataset quality, freshness, and linguistic diversity, particularly for languages like Arabic (30% error rate). Google, with its control over hardware (TPU) and financial resources, is presented as a key player in this "data war."