How we index images for RAG

192 points by mooreds 1 day ago | 26 comments

hparadiz
With media ingestion this is called "eager" processing. Historically for things like pulling thumbnails for images / video and pre-generating common sizes for things. This follows the same pattern and makes all the sense in the world. My only concern is that due to the non deterministic nature of LLMs new models will reveal new information about your data.
For example you might identify a car in an image but the context is the car running a red light. A new model might pick that up while an old one doesn't. These context adjustments might sometimes require you to rerun your LLM processing or potentially have a one to many relationship for multiple runs so you can take the best of or combine results.
Actual usage will also reveal most commonly used assets and you can target the ones that are most trafficked and save a ton on processing that way.
- giancarlostoro
  Reminds me of my years working on digital forensic software... Just I was working on smaller scale, but the idea was kind of similar, extract, carve, pull as many raw files as possible, then process them through various threads / pipelines of processing, then categorize and make some sort of report. I guess in this case, its get it all buttoned up for training. I have to also imagine, some of it goes through some level of human review, anyone wanting to make a worthwhile model is better off letting humans describe things, the outputs become drastically better is my understanding, sure the training can find all the patterns, but the wording to describe it all if you can get just enough detail, makes a difference.
- kaibuidsai
  preprocess the image with vlm was an old common practice
bad_username
> we don't send images to the model at query time. We describe each image once, at indexing time, with a cheap vision model, store the descriptions as text, and retrieve them alongside ordinary text chunks
This is what I've been doing in my Obsidian infodump for a while. If I know that an image is important, I generate a text description (Mermaid if possible, English if not) and paste it after the image in a block. This lets agents see the image if they don't really see it. Though my process is manual, the improvements in outcomes for agents that rely on text search/retrieval is very real and is worth it.
- vinzenzu
  For a RAG project for a client with a lot of PDFs and Powerpoints with images, I used ColPali a year ago. I see the provider ColiVara is still online but it seems to have fizzled out.
  Retrieving based on text and then giving the generation model the image instead is much smarter than retrieving based on image. Image-based retrieval is slow and expensive.
  Same with giving the model an image vs a structured representation of it.
  throwaw12
  > For a RAG project for a client with a lot of PDFs and Powerpoints with images, I used ColPali a year ago
  How was the accuracy compared to pre-parsing the image and doing search in the text?
  vinzenzu
  Leaps and bounds better! I don't think I benchmarked it.
  But the experience was that it was able to find small details in PDFs, in technical diagrams, and this was really not captured well at all with OCR.
  In general, OCR I think should be used more as an add-on to retrieve data, not given to the generation model itself. Similar to retrieving based off a text description and then giving the generation model the image.
- Terretta
  What does Mermaid text description of an image mean?
  Descriptions of images that are charts or diagrams to start with?
  bad_username
  Most diagrams I come across are basically boxes and arrows which are representable with mermaid flow charts without losing information. The layout of the mermaid will usually look differently, but that is not typically what matters. ChatGPT is quite good in creating mermaid flow charts from random box and arrow diagram images.
  gatlin
  Which cheap vision model would you recommend for ingesting category diagrams and producing mermaid facsimiles?
m4rkuskk
Its an ad for their product. There is nothing special about this approach and probably done the same way by everyone else.
furyman
Well I don't know if this one has been getting by others too but I have been doing this since 2 years ago and it works really well. Except the fact that for the documents I had to chunk containing these images I had to chase the authors(multiple of them) to update the relevant captions for their images. It is cost efficient than multi-modal. Lesser ingestion time altogether. Only part is that if the retrieval query is a question which can be answered only after looking at the image, then this architecture would need some little modification.
fhouser
That's smart. Just the other day, I was thinking about how I would solve images/graphs/rich PDF stuff in a RAG system. Now I know more, thanks!
- emil_sorensen
  happy it helped!
vessenes
"This is what makes the load-bearing case work,"
Man I hate that AI writing tic. I appreciate the instincts for sharing the workflow. It's still very difficult to get AI to put an info dense description together though, we tend to get long and vague.
relevant_stats
Seriously?
- Marketing material? check
- Bloated to the extreme? check
- "Get a free trial" at the end? check
- Entirely LLM generated? check
EGreg
We have it in our open-source framework, in case anyone wants to deploy it:
https://github.com/Qbix/AI/blob/6753f6e453908682401f49760002...
https://github.com/Qbix/AI/blob/main/config/observations.jso...
wrote it up here a few months ago: https://community.safebots.ai/t/building-cultural-infrastruc...
383toast
why not a multimodal embedding model?
- efavdb
  Article says this misses important details, eg data that might be in the image.
  breadislove
  very bad take. with most modern multomodal models you get way better performance then going to text first
  emil_sorensen
  it's a cost/latency trade-off in production + very use-case dependent
- sateesh
  The article do mentions why they don't use multimodal retrieval. Also I think this approach is cheaper (compute wise) than multimodal retrieval. From the article:
  Multimodal retrieval does not suit this domain. CLIP-style embeddings wash out exactly the fine detail that matters in charts, tables, and annotated screenshots, and short technical queries ("how do I configure X") give too little signal to match against image vectors
iot_devs
How descripting is the caption that you obtain?
So you include colour, shapes, etc?
sanreds
[flagged]
songting591
[flagged]
haeseong
[flagged]
infoinlet
[flagged]
hanzeweiasa
[flagged]
davidladdsource
[flagged]
justacatbot
[flagged]
EvanXue
[flagged]
factden
[dead]
hparadiz
That cookie popup just makes me wanna leave and never come back
- dang
  I think they've fixed it now.
  emil_sorensen
  Thanks! Yep fixed
hbwang2076
[flagged]