Introduction
The introduction of the original transformers paved the way for the current Large Language Models. Similarly, after the introduction of the transformer model, the vision transformer (ViT) was introduced. Like the transformers which excel at understanding text and generating text given a response, vision transformer models were developed to understand images and provide information given an image. These led to the Vision Language Models, which excel at understanding images. Microsoft has taken a step forward to this and introduced a model that is capable of performing many vision tasks just with a single model. In this guide, we will be taking a look at this model called Florence-2, released by Microsoft, designed to solve many different vision tasks.
Learning Objectives
- Get introduced to Florence-2, a Vision Language Model.
- Understanding the data on which Florence-2 is trained.
- Getting to know about different models in Florence-2 family.
- Learn how to download Florence-2.
- Write code to perform different computer vision tasks with Florence-2.
This article was published as a part of the Data Science Blogathon.
What is Florence-2?
Florence-2 is a Vision Language Model (VLM) developed by the Microsoft team. Florence-2 comes in two sizes. One is a 0.23B version and the other is a 0.77B version. These low sizes make it easy for everyone to run these models on the CPU itself. Florence-2 is created keeping in mind that one model can solve everything. Florence-2 is trained to solve different tasks including object detection, object segmentation, image captioning (even generating detailed captions), phrase segmentation, OCR (Optical Character Recognition), and a combination of these too.
The Florence-2 Vision Language Model is trained on the FLD 5B dataset. This FLD-5B is a dataset created by the Microsoft team. This dataset contains about 5.4 Billion text annotations on around 126 Million images. These include 1.3 Billion text region annotations, 500 Million text annotations, and 3.6 Billion text phrase region annotations. Florence-2 accepts text instructions and image inputs, generating text results for tasks like OCR, object detection, or image captioning.
The architecture contains a visual encoder followed by a transformer encoder decoder block and for the loss, they work with the standard loss function i.e. the cross entropy loss. The Florence-2 model performs three types of region detections: box representations for object detection, quad box representations for OCR text detection, and polygon representations for segmentation tasks.
Image Captioning with Florence-2
Image Captioning is a Vision Language task, where given an image, the deep learning model will output a caption about the image. This caption can be short or detailed based on the training the model has undergone. The models that perform these tasks are trained on a huge image captioning data, where they learn how to output a text, given an image. The more data they are trained on, the more they get good at describing the images.
Downloading and Installing
We will start by downloading and installing some libraries that we need to run the Florence Vision Model.
!pip install -q -U transformers accelerate flash_attn einops timm
- transformers: HuggingFace’s Transformers library provides various deep learning models for different tasks that you can download.
- accelerate: HuggingFace’s Accelerate library improves model inference time when serving models through a GPU.
- flash_attn: The Flash Attention library implements a faster attention algorithm than the original, and it is used in the Florence-2 model.
- einops: Einstein Operations simplifies representing matrix multiplications and is implemented in the Florence-2 model.
Downloading Florence-2 Model
Now, we need to download the Florence-2 model. For this, we will work with the below code.
from transformers import AutoProcessor, AutoModelForCausalLM
model_id = 'microsoft/Florence-2-large-ft'
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, device_map="cuda")
- We begin by importing the AutoModelForCausalLM and AutoProcessor.
- Then we store the model name in the model_name variable. Here we will work with the Florence-2 Large Fine Tuned model.
- Then we create an instance of AutoModelForCausalLM by calling the .from_pretrained() function giving it the model name and setting the trust_remote_code=True, this will download the model from the HF Repository.
- We then set this model to evaluation model by calling the .eval() and send it to the GPU by calling the .cuda() function.
- Then we create an instance of AutoProcessor by calling the .from_pretrained() and giving the model name and setting the device_map to cuda.
AutoProcessor is very similar to AutoTokenizer. But the AutoTokenizer class deals with text and text tokenization. Whereas AutoProcessor deals with both text and image tokenization, because Florence-2 deals with Image data, we work with the AutoProcessor.
Now, let us take an image:
from PIL import Image
image = Image.open("/content/beach.jpg")
Here, we have taken a beach photo.
Generating Caption
Now we will give this image to the Florence-2 Vision Language Model and ask it to generate a caption.
PROMPT = "<CAPTION>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
result = processor.post_process_generation(text_generations,
task=PROMPT, image_size=(image.width, image.height))
print(result[PROMPT])
- We begin by creating the prompt.
- Then, we give both the prompt and image to the processor class and return the PyTorch sensors. We give them to the GPU because the model resides in the GPU and stores it in the variable inputs.
- The inputs variable contains the input_ids, i.e. the token ids, and the pixel values for the image.
- Then we call the model’s generate function and give the input ids, the image pixel values. We set the maximum generated tokens to 512 keep the sampling to False and store the generated tokens in the generated_ids.
- Then we call the .batch_decode function of the processor give it the generated_ids and set the skip_special_tokens flag to False. This will be a list and hence we need the first element of the list.
- Finally, we post-process the generated text by calling the .post_process_generated and giving it the generated text, the task type, and the image_size as a tuple.
Running the code and seeing the output pic above, we see that the model has generated the caption “An umbrella and lounge chair on a beach with the ocean in the background” for the image. The image caption above is very short.
Providing Prompts
We can take this next step by providing other prompts like the <DETAILED_CAPTION> and the <MORE_DETAILED_CAPTION>.
The code for trying this can be seen below:
PROMPT = "<DETAILED_CAPTION>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
result = processor.post_process_generation(text_generations,
task=PROMPT, image_size=(image.width, image.height))
print(result[PROMPT])
PROMPT = "<MORE_DETAILED_CAPTION>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
result = processor.post_process_generation(text_generations,
task=PROMPT, image_size=(image.width, image.height))
print(result[PROMPT])
Here, we have gone with the <DETAILED_CAPTION> and <MORE_DETAILED_CAPTION> for the task type, and can see the results after running the code in the above pic. The <DETAILED_CAPTION> produced the output “In this image we can see a chair, table, umbrella, water, ships, trees, building and sky with clouds.” and the <MORE_DETAILED_CAPTION> Prompt produced the output “An orange umbrella is on the beach. There is a white lounge chair next to the umbrella. There are two boats in the water.” So with these two Prompts, we can get a bit more depth in the image captioning than the regular Prompt.
Object Detection with Florence-2
Object Detection is one of the well-known tasks in Computer Vision. It deals with finding some object given an image. In Object Detection, the model identifies the image and provides the X and Y coordinates of the bounding boxes around the object. The Florence-2 Vision Language Model is very much capable of detecting objects given an image.
Let us try this with the below image:
image = Image.open("/content/van.jpg")
Here, we have an image of a bright orange van on the road with a white building in the background.
Providing Image to Florence-2 Vision Language Model
Now let us give this image to the Florence-2 Vision Language Model.
PROMPT = "<OD>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
results = processor.post_process_generation(text_generations,
task=PROMPT, image_size=(image.width, image.height))
The process for the Object Detection is very similar to the Image Captioning task that we have just done. The only difference here is that we change the Prompt to <OD> meaning object detection. So we give this Prompt along with the image to the processor object and obtain the tokenized inputs. Then we give these tokenized inputs with the image pixel values to the Florence-2 Vision Language Model to generate the output. Then decode this output.
The output is stored in the variable named results. The variable results is of the format {”: { ‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘label1’, ‘label2’, …] } }. So the Florence-2 Vision Model outputs the bounding box, X, Y coordinates for each label, that is for each object that it detects in the image.
Drawing Bounding Boxes on the Image
Now, we will draw those bounding boxes on the image with the coordinates that we have.
import matplotlib.pyplot as plt
import matplotlib.patches as patches
fig, ax = plt.subplots()
ax.imshow(image)
for bbox, label in zip(results[PROMPT]['bboxes'], results[PROMPT]['labels']):
x1, y1, x2, y2 = bbox
rect_box = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=1,
edgecolor="r", facecolor="none")
ax.add_patch(rect_box)
plt.text(x1, y1, label, color="white", fontsize=8, bbox=dict(facecolor="red", alpha=0.5))
ax.axis('off')
plt.show()
- For drawing the rectangular bounding boxes around the images, we work with the matplotlib library.
- We begin by creating a figure and an axis and then we display the image that we have given to the Florence-2 Vision Language Model.
- Here, the bounding boxes that the model outputs are a list containing X, Y coordinates, and in the final output, there is a list of bounding boxes, that is, each label has its own bounding box.
- So we iterate through the list of bounding boxes.
- Then we unpack the X and Y coordinates of the bounding boxes.
- Then we draw a rectangle with the coordinates that we have unpacked in the last step.
- Finally, we patch it to the image that we are currently displaying.
- We even need to add a label to the bounding box to tell that the bounding box contains what object.
- Finally, we remove the axis.
Running this code and seeing the pic, we see that there are a lot of bounding boxes generated by the Florence-2 Vision Language Model for the van image that we have given to it. We see that the model has detected the van, windows, and wheels and was able to give the correct coordinates for each label.
Caption to Phrase Grounding
Next, we have a task called “Caption to Phrase Grounding” which the Florence-2 Model supports. What the model does is, given an image and a caption of it, the task of Phrase Grounding is to find each / most relevant entity/object mentioned by a noun phrase in the given caption to a region in the image.
We can take a look at this task with the below code:
PROMPT = "<CAPTION_TO_PHRASE_GROUNDING> An orange van parked in front of a white building"
task_type = "<CAPTION_TO_PHRASE_GROUNDING>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
results = processor.post_process_generation(text_generations,
task=task_type, image_size=(image.width, image.height))
Here for the Prompt, we are giving it “<CAPTION_TO_PHRASE_GROUNDING> An orange van parked in front of a white building”, where the task is “<CAPTION_TO_PHRASE_GROUNDING>” and the phrase is the “An orange van parked in front of a white building”. The Florence model tries to generate bounding boxes to the objects/entities that it can get from this given phrase. Let us see the final output by plotting it.
import matplotlib.pyplot as plt
import matplotlib.patches as patches
fig, ax = plt.subplots()
ax.imshow(image)
for bbox, label in zip(results[task_type]['bboxes'], results[task_type]['labels']):
x1, y1, x2, y2 = bbox
rect_box = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=1,
edgecolor="r", facecolor="none")
ax.add_patch(rect_box)
plt.text(x1, y1, label, color="white", fontsize=8, bbox=dict(facecolor="red", alpha=0.5))
ax.axis('off')
plt.show()
Here we see that the Florence-2 Vision Language Model was able to extract two entities from it. One is an orange van and the other is a white building. Then Florence-2 generated the bounding boxes for each of these entities. This way, given a caption, the model can extract relevant entities/objects from that given caption and be able to generate corresponding bounding boxes for those objects.
Segmentation with Florence-2
Segmentation is a process, where an image is taken and masks are generated for multiple parts of the image. Where each mask is an object. Segmentation is the next stage of Object Detection. In object detection, we only find the location of the image and generate the bounding boxes. But in Segmentation, instead of generating a rectangular bounding box, we generate a mask that will be in the shape of the object, so it is like creating a mask for that object. This is helpful because not only do we know the location of the object, but we know even the shape of the object. And luckily, the Florence-2 Vision Language Model supports Segmentation.
Segmentation on Image
We will be trying segmentation to our van image.
PROMPT = "<REFERRING_EXPRESSION_SEGMENTATION>two black tires"
task_type = "<REFERRING_EXPRESSION_SEGMENTATION>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
results = processor.post_process_generation(text_generations,
task=task_type, image_size=(image.width, image.height))
- Here, the process is similar to the Image Captioning and the Object Detection Tasks. We start by providing the Prompt.
- Here the Prompt is “<REFERRING_EXPRESSION_SEGMENTATION>two black tires” where the task is segmentation.
- The segmentation will be based on the text input provided, here it is “two black tires”.
- So the Florence-2 model will try to generate masks that are closely to this text input and the image provided.
Here the results variable will be of the format {”: {‘Polygons’: [[[polygon]], …], ‘labels’: [”, ”, …]}} where each object/mask is represented by a list of polygons. And each polygon is of the form [x1,y1,x2,y2,…xn,yn].
Creating Masks and Overlaying on Actual Image
Now, we will create these masks and overlay them on the actual image so we can visualize it better.
import copy
import numpy as np
from IPython.display import display
from PIL import Image, ImageDraw, ImageFont
output_image = copy.deepcopy(image)
res = results[task_type]
draw = ImageDraw.Draw(output_image)
scale = 1
for polygons, label in zip(res['polygons'], res['labels']):
fill_color = "blue"
for _polygon in polygons:
_polygon = np.array(_polygon).reshape(-1, 2)
if len(_polygon) < 3:
print('Invalid polygon:', _polygon)
continue
_polygon = (_polygon * scale).reshape(-1).tolist()
draw.polygon(_polygon, outline="indigo", fill=fill_color)
draw.text((_polygon[0] + 8, _polygon[1] + 2), label, fill="indigo")
display(output_image)
Explanation
- Here, we start by importing various tools from the PIL library for image processing.
- We create a deep copy of our image and store the value of the key “<REFERRING_EXPRESSION_SEGMENTATION>” in a new variable.
- Next, we load the image by creating an ImageDraw instance by calling the.Draw() method and giving the copy of the actual image.
- Next, we iterate through the zip of polygons and the label values.
- For each polygon, we then iterate through the individual polygon with the name _polygon and reshape it. The _polygon is now a high-dimensional list.
- We know that a _polygon must have at least 3 sides so it can be connected. So we check for this validity condition, to see that the _polygon list has at least 3 list items.
- Finally, we draw this _polygon on the copy of the actual image by calling the .polygon() method and giving it the _polygon. Along with that we even give it the outline color and the fill color.
- If the Florence-2 Vision Language Model generates a label for those polygons, then we can even draw this text on the copy of the actual image by calling the .text() function and giving it the label.
- Finally, after drawing all the polygons that are generated by the Florence-2 model, we output the image by calling the display function from the IPython library.
The Florence-2 Vision Language Model successfully understood our query of “two black tires” and inferred that the image contained a vehicle with visible black tires. The model generated polygon representations for these tires, which were masked with a blue color. The model excelled in diverse computer vision tasks due to the strong training data curated by the Microsoft Team.
Conclusion
Florence-2 is a Vision Language Model created and trained from the ground up by the Microsoft Team. Unlike other Vision Language Models, Florence-2 performs various computer vision tasks, including object detection, image captioning, phrase object detection, OCR, segmentation, and combinations of these. In this guide, we have taken a look at how to download the Florence-2 Large Model and how to perform different computer vision tasks with changing Prompts with the Florence-2.
Key Takeaways
- The Florence-2 model comes in two sizes. One is the base variant which is a 0.23 Billion parameter version and the other is the large variant which is a 0.7 Billion parameter version.
- Microsoft team has trained the Florence-2 model in the FLD 5B dataset, which is an image dataset containing different image tasks created by the Microsoft team.
- The Florence-2 accepts Images along with Prompt for the input. Where the Prompt defines the type of task the Florence-2 vision model should perform.
- Each task generates a different output and all these outputs are generated in the text format.
- Florence-2 is an open-source model with an MIT license, so can be worked with for commercial applications.
Frequently Asked Questions
A. Florence-2 is a Vision Language Model developed by the Microsoft team and was released in two sizes, a 0.23B parameter, and a 0.7B parameter version.
A. AutoTokenizer can only deal with text data where it converts text to tokens. On the other hand, AutoProcessor pre-processor data for multi-modal models which include even the image data.
A. FLD-5B is an image dataset curated by the Microsoft team. It contains about 5.4 billion image captioning for 126 million images.
A. Florence-2 model outputs text based on the given input image and input text. This text can be a simple image caption or it can the the bounding box coordinates if the task is object detection or segmentation.
A. Yes. Florence-2 is released under the MIT License, thus making it Open Source and one does not need to authenticate with HuggingFace to work with this model.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
By Analytics Vidhya, July 23, 2024.