Tesseract OCR: Recognize complete dictionary words only

asked by on

I'm using the tesseract OCR plugin for phonegap: https://github.com/jcesarmobile/PhonegapOCRPlugin/i

I'm trying to config tesseract to recognize complete dictionary words only. That is: no special characters, no suffixes or prefixes etc.

As the tessdata folder from this project doesn't contain any configs I thought I'd set configs on init. Right now I'm trying to set configs by modifying claseAuxiliar.mm but I can't say I've noticed any difference, this might be because the configs are wrong or that I'm setting them wrong. Below are my configs and how I'm currently trying to set them:

    // init the tesseract engine.
    tesseract = new tesseract::TessBaseAPI();
    tesseract->Init([dataPath cStringUsingEncoding:NSUTF8StringEncoding], "eng");
    if (!tesseract->SetVariable("segment_penalty_dict_nonword","10"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("segment_penalty_garbage","10"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("stopper_nondict_certainty_base","-100"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("language_model_penalty_non_dict_word","1"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("language_model_penalty_non_freq_dict_word","1"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("GARBAGE_STRING","5"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("NON_WERD","5"))
    printf("Setting variable failed!!!\n");

1 Answers

You may want to try to suppress the system dictionary and load an alternative custom dictionary.

https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc