• CocksuckerDynamo@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Its useful for people who want to know the inference response time.

    No, it’s useful for people who want to know the inference response time with batch size 1, which is not something that prospective H200 buyers care about. Are you aware that deployments in business environments for interactive use cases such as real time chat generally use batching? Perhaps you’re assuming request batching is just for offline / non-interactive use, but that isn’t the case.