I have a database which is essentially a survey tool where admins will define a survey and distribute this to a number of users who will then provide the answers to the survey questions.
The surveys are all independent of each other and fairly random…there’s no real theme to them outside of the industry the survey tool is used by.
There are a fairly large number of questions/answers in the DB (in the millions). What would be an interesting ML exercise to run on the data for a complete ML novice (but competent coder)?
I would take it as an opportunity to practice your EDA first. I bet a ton of this data is straight garbage. Maybe experiment with ML powered ways of cleaning it up if you want to make the work more exciting.
Until you know the quality, GIGO.
Hi…I wouldn’t call it garbage, but it is only relevant to the context for which it was created.
And as we are data processors, it needs no cleaning to speak of. The answers provided are appropriate to the needs of the survey creators.