• 10 Posts
  • 98 Comments
Joined 2 years ago
cake
Cake day: June 29th, 2023

help-circle

  • This is true only if the decisions were made independently. If you allow people to make a decision after they’ve seen the metrics, this no longer holds.

    Here’s an example of the first. You go at a farmer’s market with a cow and you ask everyone to write on a piece of paper what they think the weight is. If you get the replies and average them, you will find that the mean of all answers will be quite close to the real answer. A mix of non-experts and experts will iron out a good answer somehow.

    Now take the average experience of going to a restaurant. One might have just opened recently, has great food and great staff, but only 5 reviews, at an average of 3.8 or something. Another restaurant nearby has been open for 3-4 years, and has 1000 reviews, at maybe 3.9. People will usually follow the one with more reviews because they think it’s the safer option due to the information available. However, if you were to hide this and ask them to choose by just looking at the venue and the menu, they would probably choose the first one.

    Group dynamics are quite interesting, and the psychology behind this is quite funky sometimes :D



  • If you find that OCR doesn’t get you very far, maybe try a small vLM to parse PNGs of the pages. For example, Nanonets OCR will do this, although quite slow if you don’t have a GPU. It will give you a Markdown version of the page, which you can then translate with another tool.

    PaddleOCR might also be useful, since it focuses on Chinese, but it’s more difficult to set up. To add to this, some other options are MinerU and MistralOCR (this is paid, but you can test it for free if you upload it in Mistral’s library).










  • You’re right! Sorry for the typo. The older nomic-embed-text model is often used in examples, but granite-embedding is a more recent one and smaller for English-only text (30M parameters). If your use case is multi-language, they also offer a bigger one (278M parameters) that can handle English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, Chinese (Simplified). I would test them out a bit to see what works best for you.

    Furthermore, if you’re not dependent on MariaDB for something else in your system, there are also some other vector databases I would recommend. Qdrant also works quite well, and you can integrate it pretty easily in something like LangChain. It really depends on how much you want to push your RAG workflow, but let me know if you have any other questions.