Our team was working on a Windows application after many months of iterative development and releases. The application interacted with the Industrial UPS (Uninterruptible Power Supply) through
The application was released after running through rigorous testing and BETA testing by end-users for 4 months. After the release, a strange problem popped up. The application was hung at startup for few laptops. There is
no splash screen shown and no other visible signs of the application after that.
We started investigating this issue right away, as it was a critical issue blocking the end-users regular job. We tried to reproduce this issue on different kinds of laptop available locally. But none of the laptops we could reproduce this issue. We tried comparing the laptop configurations – Hardware, OS, Softwares installed etc. None of this gave a clue why the application is hanging at startup for a few.
Steps Used for Debugging
To pinpoint at the exact line of code, we’ve used remote debugger provided by Visual Studio. This can be downloaded free of cost and can be installed on the problematic laptop. Then setting right permission and having remote connectivity to the specific laptop having the
Below is the line of code which was hanging on certain laptops.
return (WriteFile(ctx_serial->w_ser.fd, req, req_length, &n_bytes, 0)) ? (ssize_t)n_bytes : -1;
On further investigation, it was found that the issue was not that all ports in that particular laptop have the issue in WriteFile API. But only certain serial ports have this issue. When the application in fact connected to the right port, it never hung.
The first approach to fix was to see if it’s an issue with synchronous reading/writing vs asynchronous reading/writing approach for serial port using FILE_FLAG_OVERLAPPED. This approach didn’t work very well.
After spending a bit more time and closely debugging the code, it was found that the problem was that the read/write timeouts are not set for the specific serial port having the issue.
Adding the below code after the CreateFile API at the initialization solved the issue!
/*Set the serial port time outs*/ COMMTIMEOUTS comm_to; struct timeval tv; unsigned int msec = 0; msec = ctx->response_timeout.tv_sec * 1000 + ctx->response_timeout.tv_usec / 1000; if (msec < 1) msec = 1; comm_to.ReadIntervalTimeout = msec; comm_to.ReadTotalTimeoutMultiplier = 0; comm_to.ReadTotalTimeoutConstant = msec; comm_to.WriteTotalTimeoutMultiplier = 0; comm_to.WriteTotalTimeoutConstant = 1000; SetCommTimeouts(ctx_serial->w_ser.fd, &comm_to); // First parameter is the HANDLE to the serial port
Lessons to be taken
- Always set timeouts for any read/write operation with an external device.
- Add informational logs before and after critical operations in an application. This is obvious and is part of best practices for programming, but it’s missed out often. Adding logs to the critical points in an application is always handy in hard to reproduce issues.
- The solution to the problem is not always in the code where the issue is happening. The issue could be a missed initialization, a wrong parameter passed or any other similar issues. This is an overlooked principle when a developer often tries to fix the code by doing some kind of hack on the issue code.