I have quite a large dataset of historical games for a sport. Generally speaking what is the best way to predict the winner of these games?

Currently I have a program transforming every game into a bunch of features (participant ages on the day, their wins, stats at the time, etc) and this outputs a binary value whether team 0 or team 1 wins. I guess my questions are:

  1. Generally speaking when training a complex model for something like game predictions where its hard to determine whether a parameter is particularly useful or not, is it better to just have as many parameters as possible? Or is it possible that too many can be detrimental. For example, I could have a single parameter for “career minutes played”. Or would it be more effective to have the career minutes played and also career minutes played for every quarter because players could have varying experience in certain times of the game

  2. What kind of model architecture is generally perceived as the best for something like this where we have 100s of input parameters all boiling down to probabilities for the outcome being 0 or 1? Currently I am trying to use both random forest classification and feed forward neural nets. If neural networks are the avenue I should pursue, is it generally agreed upon that bigger is better for FNNs? More hidden layers? Larger hidden layers?

  • Ty4Readin@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Couple of things to break down here.

    You call them “parameters” but we would normally call those “features”, just a small note.

    Your two questions are pretty similar:

    Q1. Is it better to add more features or less features?

    Q2. Is it better to have a more complex/larger model or simpler/smaller model (like a neural network)?

    The answer to both is: it depends!

    When you add more features and make your model larger/more complex, then that means your model will be able to capture more complex patterns which could be beneficial or could be harmful!

    You should read up on overfitting vs underfitting error. Generally speaking, you can reduce underfitting error by adding features and increase model complexity but that comes with the trade-off of increasing overfitting error usually.

    The question then becomes: is the gain in underfitting error outweighing the loss in overfitting error?

    The only way to know for sure is usually to test out both approaches on a validation set and choose the model and feature set that performed best.