To feed raw data like images or text directly into a Generative AI model without heavy preprocessing, use architectures like Vision Transformers (ViTs) or Transformer-based language models that process raw inputs with minimal feature engineering.
Here is the code snippet you can refer to:

In the above code we are using the following approaches:
- Uses a Vision Transformer (ViT) model that processes raw images with minimal preprocessing.
- Applies an automatic feature extractor to resize and normalize images for model compatibility.
- Directly feeds raw image data into the model without manual feature engineering.
- Performs classification inference without requiring complex preprocessing pipelines
Hence, modern Generative AI models like ViTs and Transformers efficiently process raw images or text without heavy preprocessing, making feature engineering minimal.