SALIX
Semi-Automatic Label Information eXtraction System.

Most recent version is 0.138 (6/28/2012):

All necessary files
Notes on Version 0.138:
Getting Started To get a quick overview of the program function, unzip the file, being sure to preserve the directory structure ("Users" has to be a subdirectory under SALIX.)
Run SALIX.exe. Separately, open the included file "Sample OCR Results.txt". Copy sample labels from the file and then click "Paste from Clipboard", and then click "Parse". Once you are satisfied that everything has parsed correctly, and fixed any errors, press "Export Darwin CSV" to add the results to the output file.
Features:
New in Version 0.134.x.x. Darwin Core CVS format (comma separated file) has been added and is the primary format used for our current work. This is the recommended output format. Word combinations were added to WordStats. WordStats keeps a record of words that have been used in Locality, Description and Habitat to improve parsing these fields. This version uses combinations of words for further improvement. There have been multiple bug fixes and optimizations. Note that the integrated Help file may have been broken in some versions of Windows. Please let me know if it does not work correctly for you.
New in Version 0.128.x.x. Integrated Help file. Selecting Help from the main menu will open the new help file. Several pages have context sensitive help, with either a Help menu at the top left, or a question mark at the top right. There is also improved support for the custom fields.
New in Version 0.127.x.x. A few new features have been added, besides bug fixes and under-the-hood parsing improvements. Most of these features are under the "Tools" menu. You can edit the countries file, adding states and counties. You can add min and max latitude and longitude for each country or region, which the system will check against reported coordinates. And there is a "Check Configuration" feature that will examine your support files for common errors and non-optimal settings. There is also a new, more complete list of plants. If you are updating from a previous version, be sure to include the new PlantList.txt and CountryList.txt files, as the program won't run correctly without them.
Built-in OCR. NOTE: This has been disabled in the attached version of SALIX. Commercial programs work so much better that we have stopped supporting this feature.
The program incorporates the open-source Tesseract OCR engine. The built-in OCR is convenient and works well with clear labels, such as the included samples, but if it isn't working as well with your own labels you could try a commercial OCR program. We have had good success with Abbyy OCR software.
Parsing SALIX is described as "Semi-Automatic". The parsing engine does a pretty good job of categorizing the label text, but is is necessary for a person to monitor and correct errors and omissions. Please see the section below on our experience for a discussion of how the user might interact with the program.
Field DefinitionsThe user can set up definitions for each of the fields. Definitions can consist of words frequently found in the field, words that the field may start with, whether the field must be only numeric, and which range of lines in the label the field may be found on (min and max). For example, "Collector" found as a start word would certainly indicate that the text following was the collector's name. The definitions included in the downloaded file should work pretty well for you, unless you have some unique labels.
Auto LearningSALIX learns from labels as you use it, so parsing improves over time. Learning applies to Habitat, Description, Locality, and to names in the Collector, Other Collector and Determiner fields. Learning only occurs when you "Export to file".
We have added multiple, selectable configuration files so as you process a set of labels from a given herbarium, for example, or in a different language, you are able to swap out the files and benefit from previous similar learnings.
Note specifically that the included Learning file has been trained on a large number of labels that we have processed at Arizona State University. You can reset it by going into Tools, Preferences, Clear Stats if you want to start over on some new labels.
Auto CorrectionThe user can define words or phrases that will then be automatically converted to other text.
For example, if you wish to always convert "Compositeae" to "Asteraceae", you can have the program automatically do the conversion either during OCR, or when pasting into the Label Window prior to parsing. This can also be useful if you have a frequent OCR mis-identification problem (e.g. "M" is sometimes OCR'd as "IVI").
To train an auto correction, highlight a word or phrase in the Label Window, right click, and select "AutoCorrect".
Data CheckingSALIX checks the data as it is being exported for common errors. There are several fields that should be numeric only, such as collection number, latitude/longitude, and elevation. The date should be in a standard format, and the names of collectors and determiner should not contain non-alphabetic characters. Latitude and longitude degrees, minutes and seconds should be less than 180, 90, and 60 as appropriate. If rough latitude/longitude boundaries for the countries are set in the external CountriesLongLat.txt file, SALIX will check that the coordinates fall within range.
Multiple UsersThere can be separate configuration files for different users, switched from within the program. This feature can also be used for labels with different languages or labels created by different collectors, to optimise the parameters and the auto-learning.
Help FileThe help file has recently been brought up to date. Several forms have context related help.
Log FileSALIX generates a Log file (Log.txt), that gets deleted and recreated empty each time it runs. This log file is primarily useful for the programmer to debug and develop new routines. Users are encouraged to send the log file with a description of the errors or problems seen. There is a feature on the menu (Tools, Send Log File) that can automatically send the log file to the programmers, though we have found that it doesn't work with some firewalls. If you are unable to send the log file using this feature, try adding an exception to your firewall. In particular, make sure you are not blocking port 587 for Salix.exe. If necessary, the Log.txt file can be emailed directly to Daryl Lafferty (email not posted here).
Other FeaturesThere are many other features and options. Look through the menus, and in most cases the help file will describe the use and function of each option. For the newer options, try mousing over the text for a description or right-clicking. If you can't determine the function of the option, feel free to email me to ask.
Our Experience and Process:
At ASU we have about 5 students operating SALIX. They all agree that it makes the task of converting labels to the database less tedious, faster and more accurate.
Throughput:The best users often exceed 15 labels per hour, including barcoding and photographing the labels, performing OCR, parsing the results on SALIX, and uploading the final .tsv (Tab Separated Variable) file to the database. Just running OCR, SALIX and upload on already-photographed samples, they can exceed 35-40 per hour.
Barcode and PhotographThe usual process is that the student selects about 50 specimen sheets from the cabinet. They next afix a barcode label to each sheet, in the corner as near the label as possible.

Each sheet is then photographed twice, in a single home-made fixture that positions both cameras. One camera takes a picture of the full sheet, while the other captures just the label and barcode. With high enough resolution cameras, we think the second image of the label may not be necessary, but best OCR results are obtained with more pixels.
OCRThe student then sits down at the computer with the digital images of the labels. One label is selected and OCR is performed (usually using ABBYY). The user might adjust the selected region and re-run as necessary until the correct text is captured.
The built-in Tesseract OCR performs well on clear labels, but the students' experience has been that it is often easier and faster to standardize their process using a commercial OCR program (in our case ABBYY Version 9). Many labels will OCR just fine with Tesseract, but there are enough labels that would have to be repeated with ABBYY that they find it faster to use ABBYY first. However, if you have a set of clear labels it would be worthwhile to try Tesseract first. Using the built-in Tesseract integrates OCR and parsing into one step and is significantly faster.
Transfer to SalixThis is typically done on a computer with two monitors, and SALIX is running on the other monitor. The text from ABBYY is copied to the clipboard, and then the "Paste from Clipboard" button is pressed on SALIX. This not only pastes the text into the Label Window, but also performs any Auto Correction changes to the text. By having two monitors, the results in SALIX on one monitor can be compared with the original label image on the other monitor. You could accomplish the same result by having the actual specimen sheet next to you as you work.
FormatBefore parsing, users often find it helpful to touch up the text a little. Besides correcting OCR errors, the largest benefit comes from tweaking the labels by lines; e.g. setting the latitude/longitude on a separate line, splitting habitat and description apart into separate lines when the transition occurs mid-paragraph, etc. Although SALIX does pretty good job of working across and within lines, it is more accurate when the user makes those simple changes themselves. This typically takes just a few seconds. Note also that there is a setting in "Tools, Preferences" to inform the system if you are allowing multiple fields to be on a single line. Our students always set this to tell SALIX "Don't split lines", leaving the students to do that touchup themselves.
ParseThen the "Parse Only" button is pressed, and SALIX will examine the label and categorize the text into the various fields on the sheet.
Examine for ErrorsThe next step is to examine the results. Did the right name get put into Collector? Did extra information get copied after the "Other Collectors" names? Is the Collection Number correct? Until the Automatic Learning takes place, the most likely places for errors are in Habitat, Location and Description. Until SALIX learns what your labels look like, these fields are difficult to categorize. After a few labels, however, these fields become pretty accurate.
Correct ErrorsAny changes can be made by either typing directly into the field's edit window, or highlighting the text in the Label Window and pressing the button next to the right field window. (An advantage of doing it this way is that the Log file will record the step and give me useful information about ways to improve the program.)
ExportOnce the fields are all categorized correctly, the user presses the "Export to TSV" button, and the data is added to the current data file. During this step, SALIX will check for various typical errors, such as text in a numeric-only field, numbers in a name field, date formatted correctly, Latitude or Longitude missing E/W or N/S direction and several other items.
Upload to DatabaseFinally, after several labels (20 or 30, typically) the data file is uploaded to the database.
Batch OperationWe are doing some testing of Batch OCR. Abbyy version 10 can be given a directory of images, and will produce a single document of the text from all the images. We can set this up with 100 images, for example, get it started, then go work on other things. When the OCR operation has completed, the technician can copy the text from the text file one label at a time into SALIX to process. We are still working out the kinks in the procedure, but it looks very promising.