The GPT-2 models is very big (
+7 Gb in disk) model.
In memory this model even bigger and can take a lot of resources to make inferences.
For this reason we recommended a minimum type of
- On this instance type it can take up to 40-50 secs to get a response for one query
The model supports GPU instances out of the box, for example using the
- This provides faster inference times, around 10-20 seconds
Note that because of the nature of this model and API it can only process one request at a time. So you would need to scale this accordingly to how many concurrent users you expect.