Runaway batch encoding process

ndunand · November 6, 2017, 1:30pm

Hi all,

Some media files appear to block the encoding queue. This is probably due to some corrupted or unsupported media files. The result is a runaway ffmpeg process that never ends, thus blocking the encoding queue. This is reproducible, when using the same “corrupted” source file.

The current setup is that the Batch server tries in succession 3 different encoders (ffmpeg, ffmpeg-aux, and mencoder). It tries the whole succession 3 times before giving up. The trouble is that at some point one of the tries just hangs, never ending, thus blocking the processing queue on this server

The problem is, once all batch servers are in a deadlock, new jobs don’t go through anymore.

My question is: are the runaway processes automatically killed somehow after some time? Or do I have to set up some monitoring to address this?

jess · November 6, 2017, 2:48pm

Hi @ndunand,

When you say “The result is a runaway ffmpeg process that never ends, thus blocking the encoding queue”, can you please provide the log for this job? As you said, we have two fallback transcoding mechanisms in the event the lead ffmpeg binary failed to perform the transcoding: ffmpeg-aux, which is just an older version of ffmpeg and mencoder. In the event ffmpeg failed, the other two will usually fail as well but in some cases, it’s worth a try which is why it is done.

At any rate, once all three tried and failed to transcode all the pre-defined transcoding flavours in the set, the job should be marked as failed and no further attempts should be made. Need to understand why this is not happening in your case.

In regards to aborting batch jobs, see:

ndunand · November 6, 2017, 5:10pm

Hi @jess

Thanks for your reply.

Here’s what I have.

The details of task ID show no log file (“Log File Sync Local Path”) but there is a “convert_0_5jv4dccd_39551.log” file in /opt/kaltura/tmp/convert . This file seems to belong to this job (media ID is “0_5jv4dccd” indeed) but has not been modified for more than 6 hours now. It shows as expected (from the above workflow picture) a running log for ffmpeg-aux.

ndunand · November 6, 2017, 5:11pm

I’ve posted pictures of the beginning and end of the log, as it looks like I can’t post the log itself here.

ndunand · November 6, 2017, 5:11pm

…and the end of the log

The corresponding process on the server has been using 100% on one core for the last 6 hours or so.

ndunand · November 8, 2017, 8:02am

Hi @jess,

Sorry I had to post logs in image format. See my previous posts for all details.

jess · November 9, 2017, 7:19pm

Hi @ndunand,

So, if I understand correctly, the ffmpeg process never terminates and thus the job is never marked as failed?
If that’s the case, I recommend you mark these jobs as aborted by updating the batch_job_sep DB records as I indicated here: https://forum.kaltura.org/clicks/track?url=https%3A%2F%2Fforum.kaltura.org%2Ft%2Fhow-to-cleanup-in-progress-tasks-page%2F7434%2F3&topic_id=7705 and then manually kill all these PHP CLI workers and the ffmpeg procs they spawned. You can send me the full logs by email to jess.portnoy kaltura.com and I promise to review them to try to understand what’s causing this behaviour.

It would also be helpful if you could open a trial account on our SaaS [https://corp.kaltura.com/free-trial] and check whether this can be reproduced there. If so, it’ll be easier for us to debug.

ndunand · November 10, 2017, 10:59am

Hi @jess,

Yes, your understanding is correct.

I’ll follow your recommendation next time.

Thanks for the offer ! I’m sending the logs to you via email. I also created a trial account on your SaaS but I couldn’t reproduce the problem, i.e. the faulty video file was processed – the resulting video, even though obviously corrupted, transcoded and played fine.