I have quite a large dataset of historical games for a sport. Generally speaking what is the best way to predict the winner of these games?
Currently I have a program transforming every game into a bunch of features (participant ages on the day, their wins, stats at the time, etc) and this outputs a binary value whether team 0 or team 1 wins. I guess my questions are:
-
Generally speaking when training a complex model for something like game predictions where its hard to determine whether a parameter is particularly useful or not, is it better to just have as many parameters as possible? Or is it possible that too many can be detrimental. For example, I could have a single parameter for “career minutes played”. Or would it be more effective to have the career minutes played and also career minutes played for every quarter because players could have varying experience in certain times of the game
-
What kind of model architecture is generally perceived as the best for something like this where we have 100s of input parameters all boiling down to probabilities for the outcome being 0 or 1? Currently I am trying to use both random forest classification and feed forward neural nets. If neural networks are the avenue I should pursue, is it generally agreed upon that bigger is better for FNNs? More hidden layers? Larger hidden layers?
When I do sports analysis, xgboost , elastic nets, and MaRS models are my friends. Stack a few together. Tune them well.
Sports data is usually as structured and clean as anything in the world, so I don’t think a big neural network will be necessary or helpful.
Lastly, I recommend modeling the proportion of points scored by the home team rather than winner/loser as a binary outcome, as this is more informative.
I recommend starting with as many variables as you can, fitting your model, and seeing how many variables you can cut out before your cross-validated performance starts dropping substantially.