>>14315879It's funny how many new hot-shot comp-sci people come to intern at our company and go
>oh yeah I built some models with that data and no matter how I tweaked the model it sucks, the data is useless>wait what do you MEAN the data has errors? I looked, there are no NAs!ML isn't about knowing how to model ideal data, a toddler could run machine learning on a good dataset
It's about domain-knowledge of datasets, what to expect with datasets in your field, and how to wrestle clean data (e.g., I just KNOW if a collaborator sends us an SDF/SMILES file, there's a 50/50 chance they accidentally covalently bonded all the salts to the molecule, and it's probably full of mixtures and valency issues)
The modeling is a single button click for me that does all of sklearn's algorithms + xgboost/a dozen or so pytorch models, automatic nested 5x CV hyperparameter search. That took a week to setup and runs in a few hours on our biggest dataset, a few minutes on our smaller ones. There's literally no need to think about the models themselves, just grab the best ones (in our domain, it's almost always SVC/random forest, something to do with the fact that 99% of our feature vector is a sparse bit vector, and I'm talking <2% set bits on average for 1000+ features). All of the focus is on how you treat the data, and what data you CAN get. It's no secret that the only real correlate with model-improvement is whether someone knows their domain knowledge or not; if you scramble datasets between experts in different fields, they will always build worse models than the type of data they know intimately.
It's easy to build models. It's easy to build good models if you have good data.
The real challenge is shit data and shit datasets.
It's funny how many companies we contract with have an army of "ML experts" but I somehow build a better model than their team of 15 because I, you know, know what to do with data.
That said, it's an easy as fuck job.