The Document class previously allowed multiple media entries while also having a text field, leading to ambiguity in content handling. This change enforces a clear separation between text and media documents to prevent content type confusion and simplify document processing.
A Document now must contain either text content or a single media entry, but never both. This aligns with the class's primary use in ETL pipelines where clear content type boundaries are essential for proper embedding generation and vector database storage.
Additional architectural changes: - Document now implements a cleaner API by removing deprecated methods - Removed MediaContent interface implementation from Document class - Document.getMedia() now returns a single Media object instead of Collection - Removed EMPTY_TEXT constant in favor of proper null handling - Constructor signatures simplified and streamlined - Builder pattern improved to enforce single content type constraint
The breaking changes include: - Media is now a single entry instead of a collection - Content field renamed to text for clarity - Removed support for mixed content types - Simplified builder API to prevent ambiguous construction
Prefer using text-related methods over deprecated content methods to better reflect the actual content type being handled and improve API clarity.
Comment From: markpollack
merged in dfbc394f8311b4c919079a734280aa56f6e1b7d0