Hey guys, begginers doubt:
I am preparing a dataframe for a machine learning model. The purpose of the model is to predict whether people infected with COVID will die or not.
To do this, I am looking for some conditions and symptoms, such as sore throat, cough, comorbidities, gender, and others, and binarizing them into “yes” or “no” or “male” and “female”.
I have a problem. One of the variables is “pregnant”, but only individuals of the female sex can be pregnant. How can I deal with this variable?
Can I keep it in the dataframe and assign the value “not pregnant” to all male individuals? Or could this harm the model?
It likely won’t matter. Most models (e.g. I’m guessing something like xgboost) can deal robustly with these types of correlations.
If you like, you can combine the two into a single variable and may get slightly improved performance (0 for male, 1 for female and 2 for pregnant female) assuming the dataset can fit the rule (e.g. trans men). This way, a tree-based model could draw a boundary between 0 and 1 based on gender or 1 and 2 based on pregnancy.