I was recently trying to work on a ML project and realized that the I need very specific type of data for what I am trying to achieve which makes me wonder how do people and companies deal with data scarcity, I am pretty sure they at some point need very specific type of data which isn’t easily available

  • evanthebouncy@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    hi, data monger here :D. I’m assuming you need procurement of data rather than using synthetic ones or ones already exist (internally or online).

    first step is having a good judgement on what kind of ML model you would use (or if ML is the right approach to begin with). for instance, simple tasks can be done with decision trees and nearest-neighbor, more complex tasks might require some fine-tuning of existing model you’re pulling on huggingface. what’s the simplest one for the job that you can get away with?

    once the choice of model is settled, you must estimate the quantity of the data. a simpler model can be fitted with 200 data points, an expressive model might need 100k-ish tokens. so how much does it cost to get one data point? this is the crux of your work as a data curator.

    to do that, you need to create an annotation interface. the annotators could be in-house, it could be through contractors, or it could be crowd-sourced via a website. you’ll wind up spending a good chunk of time getting the UX smooth — it does NOT need to look pretty, ur paying ppl to do work with this interface, but it NEEDS to be clear and at no point should an annotator get “lost” on what to do. every design decision in your annotation UX you make will translate multiplicatively to your overall labeling cost. so iterate relentlessly, with 2 participants, watch over them, then with 4, 8, etc.

    Once your UX is ironed out, and you’ve pilot tested it on few participants, you will have an _accurate cost guesstimate_ , where you simply look at how many data point you’ve got in your pilot study, multiplied by average time it took to procure each data point. With this estimate, you will (as a competent ML person you are) have a sense of how good your ML model will perform once trained on a comparatively larger scale set of data of this quality. You’ll get a number, it could be $1000, or $100k. Then you need to figure out the finances around how to get the dataset finally out of the way.

    hope this helped ! it is very “dirty” work but extremely powerful if done right.

  • Ok_Reality2341@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Your problem is that you are surrounded with other ML engineers

    Every bit of data is quickly fed to a neural network - new problems require new data

    The best way I advise is join an industry where ML engineers are rare, you’ll have first pick on all the data

  • Unlikely-Loan-4175@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    It’s incredibly common. Unfortunately it’s often glossed over or a non ML solution is used but labeled as ML. Recently, sometimwsthere are good solutions available where there is already a foundation model and you can do few-shot learning to calibrate with small data. .

  • coconutpie47@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    You learn about the business, AKA domain knowledge. Plus the CEO loves it when you give them an explainable and simple model rather than a fancy black box (even a decision tree) when you work with little data.

  • ai_hero@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    IMO, this is the whole reason people use Bayesian Methods in the first place - it solves the cold start problem - building a model when you don’t have data to build a model.

  • Exciting-Engineer646@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Instrumentation, user studies, being very smart about model choice (can we use an unsupervised model rather than a supervised one?)

    Synthetic or partially synthetic data is also helpful, particularly for large models.

  • skiddadle400@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Ok there are a lot of people suggesting synthetic data. This strikes me as very odd. Never done it, never seen it done successfully.* If there isn’t enough data from your process, you don’t know it well enough and there are other things you should do. If this is high value, then use Bayesian methods, causal methods, logistic regressions to try and limit your risk of being wrong by understanding which bits of your data you really depend on.

    If you have enough modeling capacity to make synthetic data, you can just make the actual model. After all it is a model that generates the synthetic data.

    *I joined a project where there was a new energy market product being introduced. The regulator had put out expected prices in a synthetic time series and my employer had made big investment decisions and I was then tasked with improving the trading logic as it wasn’t making money. Turns out the synthetic data had zip all to with the real one.

    • Grouchy-Friend4235@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      This is NOT the way to use synthetic data, obviously. It takes a rather high degree of incompetence to make money decision off of some generated reality.

      Then again we’ll see lots of that happening soon :)

    • skiddadle400@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I like how the question says, how is it done in industry. Someone from industry answers and it gets downvoted because people saying, “in my pet project I do this…” don’t like how things are done in industry.