Combining information from the visual and auditory senses can greatly enhance intelligibility of natural speech. Integration of audiovisual speech signals is robust even when temporal offsets are present between the component signals. In the present study, we characterized the temporal integration window for speech and nonspeech stimuli with similar spectrotemporal structure to investigate to what extent humans have adapted to the specific characteristics of natural audiovisual speech. We manipulated spectrotemporal structure of the auditory signal, stimulus length, and task context. Results indicate that the temporal integration window is narrower and more asymmetric for speech than for nonspeech signals. When perceiving audiovisual speech, subjects tolerate visual leading asynchronies, but are nevertheless very sensitive to auditory leading asynchronies that are less likely to occur in natural speech. Thus, speech perception may be fine-tuned to the natural statistics of audiovisual speech, where facial movements always occur before acoustic speech articulation.