Our team was working on a Windows application after many months of iterative development and releases. The application interacted with the Industrial UPS (Uninterruptible Power Supply) through serial port. The purpose of the application is to help field engineer to deploy the UPS, carry out maintenance activities, configure, measure, calibrate and upgrade the UPS firmware. As a critical application, any blocking issues need to be addressed as soon as possible, so that field engineer can carry on with their job.

Problem Statement

The application was released after running through rigorous testing and BETA testing by end-users for 4 months. After the release, a strange problem popped up. The application was hung at startup for few laptops. There is
no splash screen shown and no other visible signs of the application after that.

We started investigating this issue right away, as it was a critical issue blocking the end-users regular job. We tried to reproduce this issue on different kinds of laptop available locally. But none of the laptops we could reproduce this issue. We tried comparing the laptop configurations – Hardware, OS, Softwares installed etc. None of this gave a clue why the application is hanging at startup for a few.

Steps Used for Debugging

We added verbose and informational logs to see at what point the application is hanging. After running the new test release with additional logs, the investigation led us to the clue that the application is hanging while it is trying to write to a serial port. This was part of the initialization process, where the application will automatically run through all available ports and see which is the right serial port on which the UPS is connected.

To pinpoint at the exact line of code, we’ve used remote debugger provided by Visual Studio. This can be downloaded free of cost and can be installed on the problematic laptop. Then setting right permission and having remote connectivity to the specific laptop having the issue, we can use remote debugging to debug the application.

Below is the line of code which was hanging on certain laptops.

return (WriteFile(ctx_serial->w_ser.fd, req, req_length, &n_bytes, 0)) ? (ssize_t)n_bytes : -1;

On further investigation, it was found that the issue was not that all ports in that particular laptop have the issue in WriteFile API. But only certain serial ports have this issue. When the application in fact connected to the right port, it never hung.

Solution

The first approach to fix was to see if it’s an issue with synchronous reading/writing vs asynchronous reading/writing approach for serial port using FILE_FLAG_OVERLAPPED. This approach didn’t work very well.

After spending a bit more time and closely debugging the code, it was found that the problem was that the read/write timeouts are not set for the specific serial port having the issue.

Adding the below code after the CreateFile API at the initialization solved the issue!

/*Set the serial port time outs*/
COMMTIMEOUTS comm_to;
struct timeval tv;
unsigned int msec = 0;
msec = ctx->response_timeout.tv_sec * 1000 + ctx->response_timeout.tv_usec / 1000;
if (msec < 1)
	msec = 1;

comm_to.ReadIntervalTimeout = msec;
comm_to.ReadTotalTimeoutMultiplier = 0;
comm_to.ReadTotalTimeoutConstant = msec;
comm_to.WriteTotalTimeoutMultiplier = 0;
comm_to.WriteTotalTimeoutConstant = 1000;
SetCommTimeouts(ctx_serial->w_ser.fd, &comm_to); // First parameter is the HANDLE to the serial port

Lessons to be taken

  1. Always set timeouts for any read/write operation with an external device.
  2. Add informational logs before and after critical operations in an application. This is obvious and is part of best practices for programming, but it’s missed out often. Adding logs to the critical points in an application is always handy in hard to reproduce issues.
  3. The solution to the problem is not always in the code where the issue is happening. The issue could be a missed initialization, a wrong parameter passed or any other similar issues. This is an overlooked principle when a developer often tries to fix the code by doing some kind of hack on the issue code.