I’ve been having this issue very sporadically (sometimes a couple times a week, sometimes once a month). I’m curious as to how the more veteran folk here would try and narrow down the cause of this issue.
I can provide more info if needed!
Edit: More Info:
- Using a static IP (no DHCP) through Netplan.
I have a TP-Link HS300 that connects to a AP running openwrt, at any moment I can wireguard to AP and use Kasa APP to turn on/off any machine (not the AP though).
There is a Python package for kasa so one can use openwrt machine to detect if machine is down and automatically switch on and off if needed.
Honestly I would replace Ubuntu with an actual operating system designed for servers.
tbf i’m using ubuntu server because I’ve been using it on and off for a decade. Overall I’ve had no major issues. And its long-term support is one of the longest.
There’s the thought of going RHEL-based which I have experience with from work too, so that is an option in the long run.
Check if there’s some weirdness in IP allocation. A reboot can cause DHCP to give it a new one that works as long as it does, then fail on some weird collision.
I forgot to mention that I’m not using DHCP, statically allocating a local IP
Is there a DHCP server at play? Is the static IP outside of the DHCP range? This does sound like a typical IP collision.
DHCP is enabled on the router, but I believe the IP address is outside the designated DHCP range.
I’ll double check when I’m home!
Edit: I will also say that this modem/router is dedicated only for the server, so there shouldnt be any other clients on it at all.
This might not be applicable to your use case, but maybe it helps.
Couple of years ago I had a problem where ONE windows laptop was unable to access the internet. Sometimes it would work right away, sometimes it took 1 or 2 reboots, sometimes the damn thing wouldn’t budge.
lo and behold, it turns out the windows laptop was assigned a DHCP address that one linksys router had as a static ip. Why that resulted in a sporadic error and not a constant one I’ll never know.
So next time you have this issue, rip out the network cable from the server and try to ping the ip the server is supposed to have.
Other than that, check the journal if something start to pop up around the time you experience the problem.
Thanks for the suggestion. So I have the static IP assigned with DHCP disabled both through Netplan, not through the router.
I’ll remember to check the Netplann (?) journal/logs around that time, or are you referring to dmesg?
Since you’re not really sure what the issue is, check all the logfiles around the time the problem starts. maybe you’ll see a service stopping or starting.
Thank you I’ll do that! It’s hard to catch exactly when it happens. I think I need to get some monitoring and alert services up and running
check that nameservers are specified in /etc/resolv.conf
Good idea. I’ll check that
if ping fails with the name but works with 8.8.8.8 then you’re losing DNS. set the nameservers locally on the server if you’re using a static IP.
I do have nameservers (8.8.8.8 being one of them) set in my Netplan config, if that counts.
it should - I think (I am not familiar with netplan) - but from what I can tell it it’s a utility that simplifies the local config tasks, when you apply it, I think it should be putting the nameservers in /etc/resolv.conf - but before you go down this rabbit hole, check whether 8.8.8.8 or 1.1.1.1 respond to a ping when pings to named servers don’t - if so, it’s definitely name resolution (which would be my first guess).
Thank you, I’ll follow this advice next time it decides to cut the internet off.
I have a TP-Link HS300 that connects to a AP running openwrt, at any moment I can wireguard to AP and use Kasa APP to turn on/off any machine (not the AP though).
There is a Python package for kasa so one can use openwrt machine to detect if machine is down and automatically switch on and off if needed.
I was thinking of something like a local script to reboot the PC once internet isn’t reachable after 5 min or so, but only as a stopgap/workaround until I figure out the issue.