The Friedberg Genizah Project

The Research Platform

Inventories

Digital Images

The Data Axis

The Software Axis

Achievements in Digital Image Analysis

Inventories

The first task undertaken by “Genazim” was the compilation of a comprehensive, precise and up-to-date computerized inventory of the formal shelfmarks of all the Genizah fragments in all of the Genizah collections around the world, whether large or small, “important” or not, public or private.

To achieve this, and to assure the comprehensiveness of the process, we had, among other things, to compile, for the first time, an exhaustive list of all the public and private Genizah collections in the world, tracking down even “collections” that contain a single fragment.

As later became evident, this effort not only allowed us to attach, from that moment on, all available and future data on any fragment to its shelfmark, but also prompted Genizah researchers to make use of the precise and accurate shelfmarks (as formally defined by the relevant libraries) appearing in our inventories in their publications, thus encouraging a much needed trend of standardization in that context. Moreover, since fragments may often change shelfmarks, for the reasons detailed above, we made a sustained effort to collect all of the older or alternate shelfmarks of a given collection, to record them and to attach them to the current one, so as to correctly attach data that may have been appended to an older shelfmark to the newer one.

The Computerized Inventory List of all Genizah shelfmarks, sorted by collections, resides now in the “Genazim” servers in Jerusalem and is displayed on the Genizah website. Currently, the inventories contain about 247,000 shelfmarks and account, we believe, for the totality of Genizah shelfmarks.

Digital Images

In the early years of the 21st century, the only alternative available for a researcher desiring to study a specific Genizah fragment, other than traveling and examining it wherever it resided, was to use the corresponding microfilm, available principally at the Institute for Microfilmed Hebrew Manuscripts in Jerusalem, with all the inconveniences and shortcomings typical to this solution. The decision was therefore made, at the onset of Genazim activity, to produce full-color high-quality digital images of all Genizah fragments and to make them available through the Internet to any interested user. This would enable users to manipulate the images and study any Genizah fragment at any time and from anywhere.

This decision necessitated intense negotiations with representatives of every library that housed a Genizah collection, convincing them of the importance (and practicality) of digitizing their collection and of cooperating with the Friedberg Genizah Project in this task, persuading them to allow us to display a copy on our website, and negotiating and signing suitable legal agreements to protect their copyrights. We insisted on always digitizing both sides of every fragment, large, small or tiny, even when one (or both) of the sides seemed to be blank or un-readable. We also recorded missing fragments by taking the image of the corresponding envelope (or even of a simple page) with a “Missing” caption on it. Every image was allocated a unique “Genazim” number that (unlike the shelfmarks) is fixed and will never change. We encourage researchers to mention this number in their publications (in addition, of course, to the shelfmark).

Currently the Genizah website contains more than 400,000 digital images of Genizah fragments, representing probably more than 98% of the Cairo Genizah manuscripts. This digitization effort of the Genizah collection is probably one of the largest digitization efforts of historical manuscripts collections ever attempted, for any culture or language and by any institution.

The Data Axis

Bibliographical references

A complete set of references for all publications in any language that mention the shelfmark of a Cambridge Genizah fragment, from the discovery of the Genizah until 2008, compiled by the Cambridge Genizah Research Unit, was integrated into our databases courtesy of the Cambridge University Library. Moreover, all references to non-Cambridge shelfmarks in Hebrew publications until 2004, and an almost complete set of references to non-Cambridge fragments in publications in non-Hebrew languages, wrer compiled by the Friedberg Genizah Project bibliography teams and are also available on the website. In total, almost 200,000 such references are recorded in the databases.

Cataloging data

To every Genizah shelfmark we append (when available) a cataloging record that specifies, in (mostly) coded form, all available information on that shelfmark. Such data can be related either to the fragment’s physical aspects or to its “content” aspects: domain, title of work, author, language, script, scribe, date of copying, etc. About 270,000 such records are currently available on the website; a few of them rather “lean,” with just a couple of fields marked, others more complete.

Scans

To every shelfmark we append scans of all entries that appear in any Genizah-related catalog, whether published or printed, electronic or even just handwritten, that relate to this fragment. More than 70,000 such scanned entries are currently available on the website.

Transcriptions

Because of the sometimes difficult calligraphy and the physical state of many of the fragments, deciphering the text of a fragment is almost always a difficult task, done mostly by researchers. We therefore made an effort to attach to the image of a given fragment, whenever possible and available, its transcription. About 15,000 such transcriptions have been collected (or transcribed, when needed, to computer-readable form) and integrated in the Genizah databases, and are currently displayed on the website.

Translations

A large part of the Genizah fragments are in Judeo-Arabic, and many of these have been translated to Hebrew. A few fragments have also been translated (either from Judeo-Arabic or from Hebrew) to English. About 3,000 such translations are included in the website.

“Joins”

One of the most critical issues in Genizah research is that of discovering “joins,” i.e. different fragments — parts of folios or folios — originating from the same folio or from the same manuscript that have been dispersed in different libraries; one fragment being found, say, in Paris, and the other in Vienna. During a hundred years of research, about 4,000 or 5,000 of such “joins” were discovered through the erudition, memory and intelligence of Genizah scholars, assisted sometimes by available catalogs. Of these, about 3,000 are recorded in the system databases and clearly noted in the website.

The Software Axis

Querying the website

Searching for data on a specific shelfmark

Queries

A user can submit a query to the system and receive a list of all shelfmarks that satisfy a given set of conditions. The criteria can be a mix of all the data attached to the shelfmark. Following are some examples. One could search for:
- shelfmarks of all biblical fragments from Exodus that have cantillation signs, originate from the 12th to 14th centuries, contain at least 5 lines, and form a join with another given fragment;
- shelfmarks for which there is at least (or at most, or exactly) N (including zero) references, from a set of specified journals, or a set of specified authors, in some specified years, etc. (as an interesting example: the set of all shelfmarks from the Manchester collection that were never mentioned in any publication);
- shelfmarks from the “magic” domain, for which there are at least N different identifications, and for which there is an entry in catalog A.

Full-text Queries

A full-text search is also available, and can be applied to the transcription/translation texts, the Genizah catalogs’ text, the “running title” or the free text section of the cataloging record, etc.

Additional tools

Workspace

An individual workspace is available to every user, where they can store and manipulate in their privately designed structure a small set of images which they are currently researching, and which is stored for them from session to session

Forum

A public forum where users can exchange information,discuss issues about given shelfmarks, add or correct data, etc., is available to all users. Any user can also build a “restricted” forum for internal discussions between himself and his restricted set of colleagues.

Notes

Short notes can be written by any user, to be appended to a specific shelfmark and displayed to all users.

Input

A special module (“FOLUS” — Friedberg Online Users’ Input) allows accredited users to add information (identifications, cataloging data, joins, transcriptions, etc.) to the system, which will be integrated in the databases and displayed on the website (with their names as the source) the very next day.

Research Achievements in Digital Image Analysis

Physical attributes

Remarkable results were achieved by the advanced computerized analysis of a manuscript’s digital image. The starting point was to try and discover what physical attributes of a Genizah fragment could be automatically deduced by the computer through a fine analysis of its digital image. We develop a few software modules that, through a computerized analysis, can allow the system to:

recognize and follow the exact contour of the textual part of the image, thus separating it from its background;
measure the fragment’s inner and outer dimensions;
count its number of lines;
compute the average written-line width and length, the average inter-line width, the average “text density” (the number of letters in a specified measure unit);
compute the existence of margins and their average dimensions; and more.

It was thus proved that this type of data, considered essential in the study of manuscripts and partially found in catalogs of manuscript collections, which until now had been marked manually by scholars with a notable waste of precious research time, can now be extracted automatically from the fragment’s digital image with much more accuracy and efficiency. We implement these modules on the complete set of Genizah images, and the data derived by this process is integrated into our databases and displayed on the Genizah website

Suggesting joins

A crucial further step was achieved when we succeeded in developing a complex program capable of analyzing the handwriting in the images of two different fragments and asserting the probability that both were written by the same scribe (and so, perhaps, originate from the same manuscript). This is not done through the analysis of the individual handwritten letters and their shapes, but rather through a global comparison scheme, vaguely similar to the way in which two portraits can be compared by computer and found to be of the same person.

The basic idea is to match each of the Genizah fragments with one another so as to obtain a similarity score for each pair of fragments. Out of an estimated total number of 320,000 fragments, about 230,000 fragments were available to us, represented by 450,000 digital images, with two images per fragment (recto and verso). For every fragment, a numerical signature vector was computed, encapsulating aspects of its writing style. With a specially designed software component that measures the readability of every fragment, we eliminated from this scenario most fragments with poor legibility, those that most likely would not contribute true joins but rather would deteriorate the effectiveness of the system. These included blank or almost-blank pages, illegible or very dark texts, minute fragments, etc. After eliminating these problematic items, we were left with a total of 158,000 fragments to be compared with one another. That gave a total of 12.4 billion pairs that needed to be measured for similarity, a huge number indeed. Some twenty different similarity scores were computed and stored for each pair. These were generated by using four different algorithms to represent the handwriting style of each document and by using different similarity measures between documents. The different similarity scores can be "stacked" together to achieve higher accuracy.

During June 2013 twenty CPU's from the Computing Lab of the Blavatnik School of Computer Science at Tel Aviv University ran together continuously for 37 days (the equivalent of some 18,000 computing hours), and the task was accomplished. This computer run is probably one of the most intensive ever implemented in a digital humanities context, in terms of computing resources. Four terabytes of output were generated in the process.

An efficient and compressed database was built to preserve these results in a structure that is easy to manipulate within a reasonable on-line response time. For each fragment, the top 300 similar fragments were precomputed.

In the Join Suggestions page there are three buttons that allow the user to use any of three different algorithms (described below) and to receive relevant join suggestions.

OSS-S4 (the basic algorithm)

The algorithm ranks fragments based on stacking 4 "learned" One-Shot Similarity measures. Up to 100 results that were not presented in the clustering algorithm will be displayed.

Graph BCC- Bi-Connected Components algorithm

This algorithm builds a graph based on Bi-Connected Components and isolates the bi-connected component of the given fragment. The results obtained are fragments that are connected to the requested fragment by at least two independent pathways in the graph. The advantage of this algorithm is that it eliminates many irrelevant suggestions. It is possible, therefore, that for a given fragment there will be very few suggestions or no suggestions at all.

This algorithm was developed by Prof. Zev Wilkovitz and Dr. Zechariah Frankel from the Software Engineering department in the Ort-Braude College in Carmiel.

SCSS-1000- statistical graph model

The algorithm is based on advanced methods of graphical models, allowing the construction of a graph for the 1000 fragments that are the most similar to the given fragment. The algorithm isolates and displays the cluster containing the given fragment. The special features of the algorithm help find and offer new suggestions, even those, which according to the basic algorithm are seemingly far - according to the proximity measure of this algorithm- from the given fragment.

The Joins tool was developed as a joint project with Prof. Lior Wolf and Prof. Nachum Dershowitz of the Blavatnik School of Computer Science at Tel Aviv University.

Jigsaw

When trying to test a hypothesis about the possibility of joining 4 or 5 fragments, say, into one folio, a researcher may invoke the function “Jigsaw,” giving it the numbers of these fragments’ images. The images will then be displayed on his (preferably large) screen, where he can rotate or move any of them in an effort to fit them physically together, as in a real puzzle. If satisfied, he can then store the final image on the website.

To make the system even more user-friendly and intuitive to researchers, even if they are not completely at ease with computers, a large 42" touch-screen was installed in the Genazim lab, as a prototype, with an attached PC on which the website and the software were installed, all completely transparent to the user. Using a virtual keyboard, the user approaches the system by inputting an image number, receives back the images of the 100 best potential candidates, marks some of the relevant ones, passing them over to Jigsaw, where they can be easily manipulated – moved, rotated, flipped, calibrated – by just touching the screen with one's fingers, much like what one is used to doing nowadays with smartphones, and as naturally as one might arrange a jigsaw puzzle spread out on a table.

Word-Spotting

The Word-Spotting tool makes it possible to locate Genizah fragments that contain a given word, not only in fragments that have been manually transcribed (as with the full-text search) but in any Genizah fragment, by searching for the image of the word rather than for its textual representation.

The user highlights a search word in the image of a given fragment, and the program will then scan the digital images of the entire Genizah collection, to locate and retrieve fragments that contain an image similar to the one highlighted.

The query can emanate from any Genizah fragment, but because of the very large computer resources needed for this operation, results will be restricted temporarily to the University of Cambridge Library Taylor-Schechter Judeo-Arabic collection (CUL T-S Ar).

For the same reason the search is not processed online; the user will receive an appropriate message when the results are ready (usually just a few minutes after submitting the query).

The WS tool was developed as a joint project with Prof. Lior Wolf and Prof. Nachum Dershowitz of the Blavatnik School of Computer Science at Tel Aviv University.

Mobile Application

Genazim mobile app enables any user to view any of the 460,000 images of the Genizah manuscripts available in the Friedberg Genizah website on mobile devices: smartphones, i-pads, tablets, etc. Access to images is very fast and requires only the typing of the image FGP number. The application includes basic image processing functions such as zooming in and zooming out.

The app can be downloaded free of charge from Google Store (Google Play) or from Apple Store (App Store) under the name: Cairo Genizah