“Python tool for converting files and office documents to Markdown”
@Migueldeicaza it's a pity it isn't shipped as docker image, or didn't see that in the repo?
@madduci @Migueldeicaza The project seems to be a couple weeks old only and is open source. Do you wanna add a Dockerfile if you want to see it out there?
@djh @Migueldeicaza yes, I was thinking really about that from mobile phone is difficult ATM, but as a soon as I am again at pc
@djh @Migueldeicaza the PR is live on the repository. Locally tested, works like a charm
@Migueldeicaza funny, 10min ago i looked for something like that ;)
lets see if its better than pandoc.
@stereo @Migueldeicaza why would you want to replace pandoc?
@Migueldeicaza eww :/ so it's a https://pypi.org/project/mammoth/ wrapper with OpenAI plugged in basically?
@nina_kali_nina @Migueldeicaza not at all, it’s a wrapper around several libraries, some of which are just open source Office format parsers. unless I missed something when I read the code last week...
I think they do some kind of image interpretation, and
speech to text.
@nina_kali_nina @Migueldeicaza The code is very accessible to read: https://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py
@irskep @Migueldeicaza I left my message explicitly after reading the code for the function called DocxConverter. And there's a very similar XlsxConverter nearby. I understand that some people might just want to feed any file and get a markdown out of it, but if the intended goal is mostly Office documents, then it's a bit of a.... Disappointment
@nina_kali_nina @Migueldeicaza I should have been clearer, sorry. I meant I didn't see OpenAI "plugged in" anywhere. It's definitely not doing a lot compared to the libraries it wraps, but there is value in presenting a unified interface to many things. And yeah this value is largely to people shoving data into LLMs. But just as useful for search indexing.
@irskep I'm referring specifically to the readme that asks the users to provide their OpenAI credentials to fully benefit from the tool :) But I do guess there is some value to such a library, it's just... I'm disappointed, as I was hoping to see a library that would handle MS formats natively, and instead it uses 3rd party providers for Microsoft formats.
@Migueldeicaza how is it better than pandoc?