`Idefics` Documentation

Introduction

Welcome to the documentation for Idefics, a versatile multimodal inference tool using pre-trained models from the Hugging Face Hub. Idefics is designed to facilitate the generation of text from various prompts, including text and images. This documentation provides a comprehensive understanding of Idefics, its architecture, usage, and how it can be integrated into your projects.

Overview

Idefics leverages the power of pre-trained models to generate textual responses based on a wide range of prompts. It is capable of handling both text and images, making it suitable for various multimodal tasks, including text generation from images.

Class Definition

class Idefics:
    def __init__(
        self,
        checkpoint="HuggingFaceM4/idefics-9b-instruct",
        device=None,
        torch_dtype=torch.bfloat16,
        max_length=100,
    ):

Usage

To use Idefics, follow these steps:

Initialize the Idefics instance:

from swarms.models import Idefics

model = Idefics()

Generate text based on prompts:

prompts = [
    "User: What is in this image? https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG"
]
response = model(prompts)
print(response)

Example 1 - Image Questioning

from swarms.models import Idefics

model = Idefics()
prompts = [
    "User: What is in this image? https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG"
]
response = model(prompts)
print(response)

Example 2 - Bidirectional Conversation

from swarms.models import Idefics

model = Idefics()
user_input = "User: What is in this image? https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG"
response = model.chat(user_input)
print(response)

user_input = "User: Who is that? https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052"
response = model.chat(user_input)
print(response)

Example 3 - Configuration Changes

model.set_checkpoint("new_checkpoint")
model.set_device("cpu")
model.set_max_length(200)
model.clear_chat_history()

How Idefics Works

Idefics operates by leveraging pre-trained models from the Hugging Face Hub. Here's how it works:

Initialization: When you create an Idefics instance, it initializes the model using a specified checkpoint, sets the device for inference, and configures other parameters like data type and maximum text length.
Prompt-Based Inference: You can use the infer method to generate text based on prompts. It processes prompts in batched or non-batched mode, depending on your preference. It uses a pre-trained processor to handle text and images.
Bidirectional Conversation: The chat method enables bidirectional conversations. You provide user input, and the model responds accordingly. The chat history is maintained for context.
Configuration Changes: You can change the model checkpoint, device, maximum text length, or clear the chat history as needed during runtime.

Parameters

checkpoint: The name of the pre-trained model checkpoint (default is "HuggingFaceM4/idefics-9b-instruct").
device: The device to use for inference. By default, it uses CUDA if available; otherwise, it uses CPU.
torch_dtype: The data type to use for inference. By default, it uses torch.bfloat16.
max_length: The maximum length of the generated text (default is 100).

Additional Information

Idefics provides a convenient way to engage in bidirectional conversations with pre-trained models.
You can easily change the model checkpoint, device, and other settings to adapt to your specific use case.

That concludes the documentation for Idefics. We hope you find this tool valuable for your multimodal text generation tasks. If you have any questions or encounter any issues, please refer to the Hugging Face Transformers documentation for further assistance. Enjoy working with Idefics!

Idefics Documentation