Building tesseract5 environment and mirroring based on centos7.3 in docker

Since you need to install tesseract offline, you think of deploying tesseract as a docker image.Container-based mirroring using docker commit.

The general idea is to pull a basic centos image first, then boot it and enter the container to install tesseract before commit is mirrored.Then mirror save as a tar package.This allows offline installation.(This can also be a common way of thinking about making your own mirrors)

In fact, the next step is to install tesseract in the linux environment without downloading a basic centos image and eventually making the container a mirror.

0. Prepare files

(1)tesseract source files:

git download address: https://github.com/tesseract-ocr/tesseract

(2)leptonica-1.79.0.tar.gz.tesseract depends on this project

(3) Language Pack

All it really needs is chi_Sim.traineddataAndEng.traineddata

 

You can download it from git.Git has two versions, one fast version and one best version.

fast: https://github.com/tesseract-ocr/tessdata_fast

best:https://github.com/tesseract-ocr/tessdata_best

The difference between the two is that the fast version library file is smaller and recognizes faster.The best version of the language library file is larger and slower to recognize.Recognition accuracy, if verified.

(4) Prepare two validated pictures

 

1. Download centos image

docker pull hub.c.163.com/library/centos:latest

 

2. Start the mirror and enter the container to view the centos version:

C:\Users\Administrator>docker run -i -t hub.c.163.com/library/centos:latest /bin/bash
[root@86867025ffc7 /]# cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)
[root@86867025ffc7 /]# uname -a
Linux 86867025ffc7 4.14.154-boot2docker #1 SMP Thu Nov 14 19:19:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

 

3. tesseract-Master.zipUploaded to container's/opt/tesseract directory (host execution)

docker cp ./tesseract-master.zip 86867025ffc7:/opt/tesseract

86867025ffc7 is the centos container ID

 

4. Unzip in container

You need to download unzip first:

[root@86867025ffc7 tesseract]# yum list unzip
Loaded plugins: fastestmirror, ovl
Determining fastest mirrors
 * base: mirrors.cqu.edu.cn
 * extras: mirror.lzu.edu.cn
 * updates: mirrors.njupt.edu.cn
updates/7/x86_64/primary_db                                                         | 1.3 MB  00:00:00
Available Packages
unzip.x86_64                                        6.0-21.el7                                         base
[root@86867025ffc7 tesseract]# yum install unzip

Unzip:

unzip ./tesseract-master.zip

 

5. Start installing the compilation environment

(1) Install compilation environment: gcc gcc-c++ make

yum install gcc gcc-c++ make

 

(2) Install packages necessary for tesseract-ocr compilation

yum install autoconf automake libtool   libjpeg-devel libpng-devel libtiff-devel

 

6. Install leptonica

(1) Upload leptonica-1.79.0.Tar.gzTo container: (host execution)

docker cp ./leptonica-1.79.0.tar.gz 86867025ffc7:/opt/tesseract

(2) Unzip and install in the container:

tar -zxvf ./leptonica-1.79.0.tar.gz

 

./autogen.sh 
./configure
make && make install

 

(3) Add environment variables (yum install vim install if vim is not a command)

vim /etc/profile

Finally, add the following:

export LD_LIBRARY_PATH=$LD_LIBRARY_PAYT:/usr/local/lib
export LIBLEPT_HEADERSDIR=/usr/local/include
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

 

Execute the following command for immediate effect:

source /etc/profile

 

(4) Execute the following command to confirm lept output, and if not, review the steps above

[root@86867025ffc7 leptonica-1.79.0]# pkg-config --list-all | grep lept
lept             leptonica - An open source C library for efficient image processing and image analysis operations

 

7. Install tesseract, enter tesseract-master, and execute the following command:

./autogen.sh
./configure
make && make install

 

I made an error while making and make install. I upgraded the next GCC version. The original GCC version was 4.X. After upgrading to 8, I can make && make install again:

Step 1: Install the scl source:
yum install centos-release-scl scl-utils-build
 Step 2: List available scl sources
yum list all --enablerepo='centos-sclo-rh'
yum list all --enablerepo='centos-sclo-rh' | grep "devtoolset-"
Step 3: Install version 8 of the gcc, gcc-c++, gdb toolchian:
yum install -y devtoolset-8-toolchain
scl enable devtoolset-8 bash

 

View gcc version:

[root@86867025ffc7 tesseract-master]# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)

 

Re configure and make, make install after upgrading gcc.Otherwise an invalid pointer will be missed.If make, make install executes without error, you do not need to upgrade gcc.

 

8. Testing the tesseract version:

[root@86867025ffc7 tesseract-master]# tesseract -v
tesseract 5.0.0-alpha
 leptonica-1.79.0
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7
 Found AVX2
 Found AVX
 Found SSE
 Found OpenMP 201511

 

9. Copy language packs into containers/usr/local/share/tessdata

tesseract --list-langs

I made an error when executing:

*** Error in `tesseract': free(): invalid pointer: 0x000000000065bff0 ***

Solution: (Re-execute configure, make, make install to tesseract-master directory)

 

Execute the viewing language again:

[root@86867025ffc7 tesseract-master]# tesseract --list-langs
List of available languages (2):
chi_sim
eng

 

10. Identify for verification: upload prepared pictures to the container's/opt/ocrtemplate

[root@86867025ffc7 ocrtemplate]# tesseract ./zh.png result -l chi_sim -c preserve_interword_spaces=1 --dpi 300 --oem 1
Tesseract Open Source OCR Engine v5.0.0-alpha with Leptonica
[root@86867025ffc7 ocrtemplate]# ls
result.txt  zh.png
[root@86867025ffc7 ocrtemplate]# cat result.txt
//Chinese Server

 

The tesseract command is as follows:

#Console Receive
tesseract ./normal.png stdout -l chi_sim

#Specify only the language.Chi_is required for the specified languageSim_Vert.traineddata
tesseract ./normal.png chi_sim__simhei_result -l chi_sim
tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1
tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1 --dpi 300
tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1 --dpi 300 --oem 1

# Appoint --psm 1 Required osd.traineddata
tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1 --dpi 300 --oem 1 --psm 1

 

11. Make the container a mirror: (check the container ID, and commit a mirror)

Administrator@MicroWin10-1535 MINGW64 ~/Desktop/dockertest
$ docker ps -a
CONTAINER ID        IMAGE                                 COMMAND             CREATED             STATUS              PORTS               NAMES
86867025ffc7        hub.c.163.com/library/centos:latest   "/bin/bash"         18 hours ago        Up 18 hours                             objective_swanson

Administrator@MicroWin10-1535 MINGW64 ~/Desktop/dockertest
$ docker commit 86867025ffc7 zdtesseract
sha256:46153c6cb7da8deb023aacaa46413822ee921a147151a235375e1a7f4ad64fda

Administrator@MicroWin10-1535 MINGW64 ~/Desktop/dockertest
$ docker images | grep zdte
zdtesseract                             latest               46153c6cb7da        About a minute ago   1.64GB

Administrator@MicroWin10-1535 MINGW64 ~/Desktop/dockertest
$ docker history zdtesseract
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
46153c6cb7da        7 minutes ago       /bin/bash                                       1.44GB
328edcd84f1b        2 years ago         /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B
<missing>           2 years ago         /bin/sh -c #(nop)  LABEL name=CentOS Base Im...   0B
<missing>           2 years ago         /bin/sh -c #(nop) ADD file:63492ba809361c51e...   193MB

 

The installation of tesseract in the container has now been completed.Installing tesseract in the actual linux is also the step above.

 

The following actions are for the convenience of the docker.If we want to run without entering the container, we may execute the following commands:

docker run --rm -v /opt/ocrtemplate:/opt/ocrtemplate zdtesseract tesseract /opt/ocrtemplate/normal.png /opt/ocrtemplate/result -l chi_sim --oem 1 --dpi 300

- rm specifies the container to be automatically deleted after completion, -v is the relationship between the specified host and docker container mount directories (host directory: docker container directory).Generated after execution under the / opt/ocrtemplate directory inside the containerResult.txt, the host and container/opt/ocrtemplate directories have a mount relationship, so the file will also be generated in the host directory, resulting in the final OCR result file.

 

12. Testing

If it is a virtual box virtual machine for windows, you need to enter the virtual machine first.

 

docker-machine ssh default  #Enter Virtual Machine
sudo -i    #Switch Users

 

Then execute:

root@default:/opt/ocrtemplate# docker run --rm zdtesseract tesseract -v
tesseract 5.0.0-alpha
 leptonica-1.79.0
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7
 Found AVX2
 Found AVX
 Found SSE
 Found OpenMP 201511
root@default:/opt/ocrtemplate# docker run --rm zdtesseract tesseract --list-langs
List of available languages (2):
chi_sim
eng

 

For convenience, we alias the above command:

root@default:/opt/ocrtemplate# alias tesseract='docker run --rm -v /opt/ocrtemplate:/opt/ocrtemplate zdtesseract tesseract' #Alias
root@default:/opt/ocrtemplate# alias  #alias
alias tesseract='docker run --rm -v /opt/ocrtemplate:/opt/ocrtemplate zdtesseract tesseract'
root@default:/opt/ocrtemplate# tesseract -v  #View Version with Alias
tesseract 5.0.0-alpha
 leptonica-1.79.0
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7
 Found AVX2
 Found AVX
 Found SSE
 Found OpenMP 201511

Identify files for testing:

(1) Console reception:

tesseract /opt/ocrtemplate/normal.png stdout -l chi_sim --oem 1 --dpi 300

(2) Generate files:

tesseract /opt/ocrtemplate/normal.png /opt/ocrtemplate/result -l chi_sim --oem 1 --dpi 300

Interpretation: The above is actually executing commands

docker run --rm -v /opt/ocrtemplate:/opt/ocrtemplate zdtesseract tesseract /opt/ocrtemplate/normal.png /opt/ocrtemplate/result -l chi_sim --oem 1 --dpi 300

- rm specifies the container to be automatically deleted after completion, -v is the relationship between the specified host and docker container mount directories (host directory: docker container directory).Generated after execution under the / opt/ocrtemplate directory inside the containerResult.txt, the host and container/opt/ocrtemplate directories have a mount relationship, so the file will also be generated in the host directory, resulting in the final file.

 

13. Package and archive mirrors for offline installation

docker save -o zdtesseract.tar zdtesseract

 

The tar package can then be migrated and installed offline.

Of course, it can be submitted to the mirror warehouse.

Tags: Docker CentOS Linux yum

Posted on Wed, 03 Jun 2020 21:15:29 -0400 by leeue