2020-05-28, 09:43 PM
(This post was last modified: 2020-05-28, 10:08 PM by pipefan413.)
Hi guys,
This probably isn't a big deal but I'm wondering for the purposes of a very very precise sync I'm currently doing (to the level of individual audio samples). I figure if I'm being this accurate I may as well get this bit right. Apologies for the number of calculations here, but it's probably necessary to explain precisely why I'm asking what I'm asking; if you want to cut to the chase and see what the question actually is, I've highlighted it below in bold. Explanation follows from here...
I'm taking an audio stream containing 362,602,496 audio samples at 48 kHz, adding 18,018 samples to sync, and combining it with a custom video stream containing 181393 frames of video. To compare the video length to the audio length very precisely, this number of video frames is equivalent to 181393 / (24000/1001) x 48000 = 363,148,786 audio samples. This means that the custom video outlasts the synced audio by 363,148,786 - (362,602,496 + 18,018) = 528,272 samples, or rather 528,272 / 48000 = 11.005666... seconds.
The audio is from a different video file which contains 181121 frames (equiv. 181121 / (24000/1001) x 48000 = 362,604,242 audio samples). As it appeared there, the audio lasts almost as long as the video, such that there is audio playing during almost every video frame although not all the way through to the end of the last frame (it's a few audio frames short). More precisely, the difference is 362,604,242 - 362,602,496 = 1746 samples, or rather 1746 / 48000 = 0.036375 seconds (36.375 ms). The duration of one video frame is 1 / (24000/1001) = 0.041708333... seconds = 41.708... ms, so the audo stream is still active until the second-to-last video frame, although in practice it's already fallen silent a few seconds before that.
Here's the question, then:
Is there any benefit to adding silence to the end of an audio bitstream in order to more closely match the duration of the video stream it's muxed to, for the purposes of Blu-ray Disc compliance?
If not, I'll just leave it so that the audio track fades to silence then ends shortly after that, then the video (hopefully) will continue playing for another 11 seconds after that. However, I'm wondering if this might cause some software or hardware players to freak out and possibly stop the video once the audio ends; I know some players will do this if the tracks are the other way around (with the video shorter than the audio) but I'm guessing that the video is always prioritised so it may not matter when it's this way round. If it is an issue though, I'll add silence to the end of the audio as follows...
363,148,786 - (362,602,496 + 18,018) = 528272 samples, but I'm encoding with DTS-HD MA which contains a DTS core with 512 samples per frame and 363148786 is not divisible by 512 (363,148,786 / 512 = 709,274.972... audio frames) so I'd round down the audio stream to 709,274 x 512 = 363,148,288 samples. This would make the discrepancy only 363,148,786 - 363,148,288 = 498 audio samples, or in real terms, 498 / 48000 x 1000 = 10.375 ms. That's almost identical to the difference between the original audio and video durations on the source I took the audio from, which may not necessarily be coincidental!
EDIT: Mostly inconsequential but just in case anybody notices and thinks I missed it... I realise that if I just leave the audio as is without adding silence to the end, it's 362,620,514 which isn't divisible by 512 (362,620,514 / 512 = 708,243.19140...). There are two simple solutions to that: either encode it as is and the DTS encoder rounds it up to the nearest whole 512-sample frame, or add samples before encoding to achieve the exact same result. Either way, it'd have 708,244 x 512 = 362,620,928 samples in the audio if I don't add loads more to the end so that it more closely matches the number of video frames.
This probably isn't a big deal but I'm wondering for the purposes of a very very precise sync I'm currently doing (to the level of individual audio samples). I figure if I'm being this accurate I may as well get this bit right. Apologies for the number of calculations here, but it's probably necessary to explain precisely why I'm asking what I'm asking; if you want to cut to the chase and see what the question actually is, I've highlighted it below in bold. Explanation follows from here...
I'm taking an audio stream containing 362,602,496 audio samples at 48 kHz, adding 18,018 samples to sync, and combining it with a custom video stream containing 181393 frames of video. To compare the video length to the audio length very precisely, this number of video frames is equivalent to 181393 / (24000/1001) x 48000 = 363,148,786 audio samples. This means that the custom video outlasts the synced audio by 363,148,786 - (362,602,496 + 18,018) = 528,272 samples, or rather 528,272 / 48000 = 11.005666... seconds.
The audio is from a different video file which contains 181121 frames (equiv. 181121 / (24000/1001) x 48000 = 362,604,242 audio samples). As it appeared there, the audio lasts almost as long as the video, such that there is audio playing during almost every video frame although not all the way through to the end of the last frame (it's a few audio frames short). More precisely, the difference is 362,604,242 - 362,602,496 = 1746 samples, or rather 1746 / 48000 = 0.036375 seconds (36.375 ms). The duration of one video frame is 1 / (24000/1001) = 0.041708333... seconds = 41.708... ms, so the audo stream is still active until the second-to-last video frame, although in practice it's already fallen silent a few seconds before that.
Here's the question, then:
Is there any benefit to adding silence to the end of an audio bitstream in order to more closely match the duration of the video stream it's muxed to, for the purposes of Blu-ray Disc compliance?
If not, I'll just leave it so that the audio track fades to silence then ends shortly after that, then the video (hopefully) will continue playing for another 11 seconds after that. However, I'm wondering if this might cause some software or hardware players to freak out and possibly stop the video once the audio ends; I know some players will do this if the tracks are the other way around (with the video shorter than the audio) but I'm guessing that the video is always prioritised so it may not matter when it's this way round. If it is an issue though, I'll add silence to the end of the audio as follows...
363,148,786 - (362,602,496 + 18,018) = 528272 samples, but I'm encoding with DTS-HD MA which contains a DTS core with 512 samples per frame and 363148786 is not divisible by 512 (363,148,786 / 512 = 709,274.972... audio frames) so I'd round down the audio stream to 709,274 x 512 = 363,148,288 samples. This would make the discrepancy only 363,148,786 - 363,148,288 = 498 audio samples, or in real terms, 498 / 48000 x 1000 = 10.375 ms. That's almost identical to the difference between the original audio and video durations on the source I took the audio from, which may not necessarily be coincidental!
EDIT: Mostly inconsequential but just in case anybody notices and thinks I missed it... I realise that if I just leave the audio as is without adding silence to the end, it's 362,620,514 which isn't divisible by 512 (362,620,514 / 512 = 708,243.19140...). There are two simple solutions to that: either encode it as is and the DTS encoder rounds it up to the nearest whole 512-sample frame, or add samples before encoding to achieve the exact same result. Either way, it'd have 708,244 x 512 = 362,620,928 samples in the audio if I don't add loads more to the end so that it more closely matches the number of video frames.