Cerebro Seco

Se faciliter la vie informatique sans sacrifier ses principes!

Aller au contenu | Aller au menu | Aller à la recherche

Recover from a -10810 error without a reboot

Difficulty: fairly difficult. Implies dealing with Terminal and requires developed abstract thinking on such concepts as Mac OS X process table, child processes, and yes, ZOMBIES!

Background

I encountered this seemingly common error among Mac OS X tinkerers, and unfortunately it doesn't pinpoint a precise issue, and is rather non-specific. Namely, it just shows that the process table is full. Of course, the simplest way to get rid of it is just restart the machine, but I hate doing so without proper understanding of the underlying issue.

But first, a bit of history. There are holes in this post and it may not represent an accurate account of what happened exactly. I formatted an external, USB drive in exFAT using OS X Disk Utility and be able to share very large files between Windows and the Mac. In a Windows 7 virtual machine, I mounted this disk through VirtualBox 4.3.8's standard tools, then launched a Windows application. I don't remember which one exactly, just that it wasn't detected as a virus (or did I disable AVG? Can't remember) but in a few seconds, the disk got unmounted and would only show as unformatted in Windows as well as the Mac afterwards. Common operations such as disk verification through Windows or Mac disk checking tools yielded no positive result. Even DriveGenius wouldn't detect any bad sector, so at least I could rule this out.

Corrupted hard drive failed recovery attempt

I knew that even the fastest application couldn't possibly erase a USB 2.0 hard drive in a few seconds, especially a slower, ATA one. Even such a short time to erase the partition table seemed rather optimistic, so my guess was that some sort of partition "flag" got screwed up by this Windows application. Purist will scowl here, but I'm not into any computer-related trade, so this may not be the correct word. In the end I deduced that it would be a short trip to resetting this "flag".

I was wrong. First I used a reputed Windows data recovery software, which readily listed all the files inside and allowed me to recover some, but this is not what I wanted, partly because I don't currently have the disk space to hold 500GB worth of (mostly replaceable) data. I then remembered that I used PhotoRec in a distant past to retrieve pics from an erased SD card, and the same developer also released TestDisk:

TestDisk is powerful free data recovery software! It was primarily designed to help recover lost partitions and/or make non-booting disks bootable again when these symptoms are caused by faulty software, certain types of viruses or human error (such as accidentally deleting a Partition Table). Partition table recovery using TestDisk is really easy.

(Excerpt from TestDisk's homepage)

This seemed exactly what I was looking for. So I proceeded on downloading the 6.14 version of it, as of writing this post, the current version, for Mac OS X. I was glad to see the author compiled it for all three major platforms, and just went with the native OS X version. Turned on the hard drive, connected it, then launched TestDisk's executable, then reloaded it as root using the sudo option at the bottom of its interface. Note screenshots show version 7.0-WIP as I was directed to try it by the developer.

Capture_d_e_cran_2014-03-15_a__12.31.48.png

I was able to see the disk in the choice list, and proceeded with the scan of it. It correctly detected an EFI GPT partition map.

Capture_d_e_cran_2014-03-15_a__12.33.17.png

The next step was actually to an analyse the drive.

Capture_d_e_cran_2014-03-15_a__12.33.21.png

At first, it seemed to be able to find the lost partition, and even appeared to find some other lost Apple partitions that couldn't be recovered. Something serious happened there with so many overlapping partitions. I don't remember this drive ever being that extensively used

Capture_d_e_cran_2014-03-15_a__12.33.25.png

Capture_d_e_cran_2014-03-15_a__12.36.58.png

Then I went on by selecting the partition I wanted to recover, and selected a deep search.

Capture_d_e_cran_2014-03-15_a__12.37.09.png

Capture_d_e_cran_2014-03-15_a__12.37.24.png

And it output an even greater number of errors I found no reference to on the Internet

Capture_d_e_cran_2014-03-15_a__12.43.03.png
And soon after, I received this error message, also quite unknown from the Internets in relation to TestDisk:

Capture_d_e_cran_2014-03-15_a__12.43.56.png

Seemingly innocuous, right? Wrong. From the Activity Monitor, I could see syslogd, the daemon responsible for logging what happens in the system, was essentially competing with TestDisk for CPU power, hogging all the i7 juice while it ran. In the process, it generated more than 15GB of logs, pure text! I had to kill the process using a [ctrl]+[C] in the Terminal.

Needless to say, I got in touch with Christophe Grenier, the developer, and wrote I couldn't possibly link this log to an email, even in compressed form, as it was still more than 500MB. A lot of read errors could lead to such a log, he wrote. So with few background and good will, I tested the same TestDisk in Kali Linux, on the same drive with same cables. No similar power-hogging behaviour was found.

After-effects of the failed HDD scan

Indeed, the most annoying wasn't the CPU overload. It was the overfilling of the process table over time, triggering an error -10810 when I attempted to launch any application. As only the Activity Monitor would open, I could see the operations number being above 200. While command

$ ulimit -a

showed a practical limit of 266 for the user processes (at least on non-server versions of OS X), I could readily see some command-line programs, most likely automated system tasks never actually quitted and clogged up the process table, rendering the machine unresponsive despite having some free RAM and a reasonable swap amount (less than 2GB).

Hypotheses

From this page about the Linux cousin, actual default maximum number of PIDs is 32767. While absolute PIDs were much higher than that, it also states PID are routinely attributed skipping integers to avoid reusing the same ones too soon.

Meanwhile, the hypothesis evolved as I was steadily trying to fire up a Console window as soon as I killed process. I finally succeeded after a long while, and the system.log clearly shown many entries of

proc: table is full

So with only about a hundred or so processes reported by the Activity Monitor, I inferred the cause to be zombie processes since they don't appear there. Or, a runaway process. Or, a coding error on Christophe Grenier's part. Having yet to receive an answer from him, I can't exclude the possibility.

I ruled out the other sources of this error since no external HDD or server was active at the time this bug appeared (except for the recovered one of course).

Reducing the number of processes

First, I ordered the operations by PID. This gave a rough idea of the order in which they've started. kernel_task always launches first with PID 0, then launchd with PID 1, and most user processes having PIDs above 200.

Since reducing the number of started process was restricted by the use of the only workable application, Activity Monitor, I started double-clicking on each unclosed process to know its parent operation. In the end, all came from launchd, no surprise here.

However, you can't simply cut superfluous process beginning by launchd. This makes no sense, and even if the system allowed it, you're likely to crash it immediately.

The key here is to processes by PID, then hierarchically. As one carefully kills parent process of redundant ones, passing the admin password every time (tedious!), the ones likely to have the highest PID, reported processes and threads from the CPU tab in the Activity Monitor will decrease, on average. On average, because when you kill a process, another may spawn in reaction to it, then shut down by itself. Eventually, I reached the lowest PID-ed process that started them all without crashing the system, and immediately noticed queued applications opened right away and all child processes finally died. Unfortunately, I can't remember which one as I crashed the system before writing this post, seemingly from an unrelated bug. Other very precise instructions are available here.

Of course there are other resources in tackling this old-time -10810 error:

From How to fix the "The application Finder can't be opened (-10810)" error:

  • Open up the Terminal application (use Spotlight if you cannot find it)
  • Type: /System/Library/CoreServices/Finder.app/Contents/MacOS/Finder &
  • Press enter and the Finder should launch. You're done! You can close the Terminal

The application Finder.app can't be opened.

Top 10 DTrace scripts for Mac OS X