How do you source and license data for AI training and fine-tuning?

#webdev #ai #opensource #database

I've been working on AI projects for a while now and I keep running into the same problem over and over again. Wondering if it's just me or if this is a universal developer experience.

I need specific training and fine-tuning data for a model. Not the usual stuff you find on Kaggle or other public datasets, but something more niche or specialized, for e.g. financial data from a particular sector, medical datasets, etc. I try to find quality datasets, but most of the time, they are hard to find or license, and not the quality or requirements I am looking for.

So, how do you typically handle this? Do you mostly use datasets free/open source? Do you use synthetic data? Do you use whatever datasets might be similar to your needs, but may compromise training/fine-tuning?

I'm curious if there is a better way to approach this, or if struggling with data acquisition is just part of every developer's process. Do bigger companies have the same problems in sourcing and finding suitable data?

If you can share any tips regarding these issues, or share your experience in finding better suitable datasets, will be much appreciated!

DEV Community

How do you source and license data for AI training and fine-tuning?

Top comments (0)