- Inventories
The first task undertaken by “Genazim” was the compilation of a comprehensive, precise
and up-to-date computerized inventory of the formal shelfmarks of all the Genizah
fragments in all of the Genizah collections around the world, whether large or small,
“important” or not, public or private.
To achieve this, and to assure the comprehensiveness of the process, we had, among
other things, to compile, for the first time, an exhaustive list of all the public
and private Genizah collections in the world, tracking down even “collections” that
contain a single fragment.
As later became evident, this effort not only allowed us to attach, from that moment
on, all available and future data on any fragment to its shelfmark, but also prompted
Genizah researchers to make use of the precise and accurate shelfmarks (as formally
defined by the relevant libraries) appearing in our inventories in their publications,
thus encouraging a much needed trend of standardization in that context. Moreover,
since fragments may often change shelfmarks, for the reasons detailed above, we
made a sustained effort to collect all of the older or alternate shelfmarks of a
given collection, to record them and to attach them to the current one, so as to
correctly attach data that may have been appended to an older shelfmark to the newer
one.
The Computerized Inventory List of all Genizah shelfmarks, sorted by collections,
resides now in the “Genazim” servers in Jerusalem and is displayed on the Genizah
website. Currently, the inventories contain about 247,000 shelfmarks and account,
we believe, for the totality of Genizah shelfmarks.
- Digital
Images
In the early years of the 21st century, the only alternative available for a researcher
desiring to study a specific Genizah fragment, other than traveling and examining
it wherever it resided, was to use the corresponding microfilm, available principally
at the Institute for Microfilmed Hebrew Manuscripts in Jerusalem, with all the inconveniences
and shortcomings typical to this solution. The decision was therefore made, at the
onset of Genazim activity, to produce full-color high-quality digital images of
all Genizah fragments and to make them available through the Internet to any interested
user. This would enable users to manipulate the images and study any Genizah fragment
at any time and from anywhere.
This decision necessitated intense negotiations with representatives of every library
that housed a Genizah collection, convincing them of the importance (and practicality)
of digitizing their collection and of cooperating with the Friedberg Genizah Project
in this task, persuading them to allow us to display a copy on our website, and
negotiating and signing suitable legal agreements to protect their copyrights. We
insisted on always digitizing both sides of every fragment, large, small or tiny,
even when one (or both) of the sides seemed to be blank or un-readable. We also
recorded missing fragments by taking the image of the corresponding envelope (or
even of a simple page) with a “Missing” caption on it. Every image was allocated
a unique “Genazim” number that (unlike the shelfmarks) is fixed and will never change.
We encourage researchers to mention this number in their publications (in addition,
of course, to the shelfmark).
Currently the Genizah website contains more than 400,000 digital images of Genizah
fragments, representing probably more than 98% of the Cairo Genizah manuscripts.
This digitization effort of the Genizah collection is probably one of the largest
digitization efforts of historical manuscripts collections ever attempted, for any
culture or language and by any institution.
- The
Data Axis
Bibliographical references
A complete set of references for all publications in any language that mention the
shelfmark of a Cambridge Genizah fragment, from the discovery of the Genizah until
2008, compiled by the Cambridge Genizah Research Unit, was integrated into our databases
courtesy of the Cambridge University Library. Moreover, all references to non-Cambridge
shelfmarks in Hebrew publications until 2004, and an almost complete set of references
to non-Cambridge fragments in publications in non-Hebrew languages, wrer compiled
by the Friedberg Genizah Project bibliography teams and are also available on the
website. In total, almost 200,000 such references are recorded in the databases.
Cataloging data
To every Genizah shelfmark we append (when available) a cataloging record that specifies,
in (mostly) coded form, all available information on that shelfmark. Such data can
be related either to the fragment’s physical aspects or to its “content” aspects:
domain, title of work, author, language, script, scribe, date of copying, etc. About
270,000 such records are currently available on the website; a few of them rather
“lean,” with just a couple of fields marked, others more complete.
Scans
To every shelfmark we append scans of all entries that appear in any Genizah-related
catalog, whether published or printed, electronic or even just handwritten, that
relate to this fragment. More than 70,000 such scanned entries are currently available
on the website.
Transcriptions
Because of the sometimes difficult calligraphy and the physical state of many of
the fragments, deciphering the text of a fragment is almost always a difficult task,
done mostly by researchers. We therefore made an effort to attach to the image of
a given fragment, whenever possible and available, its transcription. About 15,000
such transcriptions have been collected (or transcribed, when needed, to computer-readable
form) and integrated in the Genizah databases, and are currently displayed on the
website.
Translations
A large part of the Genizah fragments are in Judeo-Arabic, and many of these have
been translated to Hebrew. A few fragments have also been translated (either from
Judeo-Arabic or from Hebrew) to English. About 3,000 such translations are included
in the website.
“Joins”
One of the most critical issues in Genizah research is that of discovering “joins,”
i.e. different fragments — parts of folios or folios — originating from the same
folio or from the same manuscript that have been dispersed in different libraries;
one fragment being found, say, in Paris, and the other in Vienna. During a hundred
years of research, about 4,000 or 5,000 of such “joins” were discovered through
the erudition, memory and intelligence of Genizah scholars, assisted sometimes by
available catalogs. Of these, about 3,000 are recorded in the system databases and
clearly noted in the website.
-
The Software Axis
Querying the website
- Searching for data on a specific shelfmark
-
Queries
A user can submit a query to the system and receive a list of all shelfmarks that
satisfy a given set of conditions. The criteria can be a mix of all the data attached
to the shelfmark. Following are some examples. One could search for:
- shelfmarks of all biblical fragments from Exodus that have cantillation signs, originate
from the 12th to 14th centuries, contain at least 5 lines, and form a join with
another given fragment;
- shelfmarks for which there is at least (or at most, or exactly) N (including zero)
references, from a set of specified journals, or a set of specified authors, in
some specified years, etc. (as an interesting example: the set of all shelfmarks
from the Manchester collection that were never mentioned in any publication);
- shelfmarks from the “magic” domain, for which there are at least N different identifications,
and for which there is an entry in catalog A.
- Full-text Queries
A full-text search is also available, and can be applied to the transcription/translation
texts, the Genizah catalogs’ text, the “running title” or the free text section
of the cataloging record, etc.
Additional tools
Workspace
An individual workspace is available to every user, where they can store and manipulate
in their privately designed structure a small set of images which they are currently
researching, and which is stored for them from session to session
Forum
A public forum where users can exchange information,discuss issues about given shelfmarks,
add or correct data, etc., is available to all users. Any user can also build a
“restricted” forum for internal discussions between himself and his restricted set
of colleagues.
Notes
Short notes can be written by any user, to be appended to a specific shelfmark and
displayed to all users.
Input
A special module (“FOLUS” — Friedberg Online Users’ Input) allows accredited users
to add information (identifications, cataloging data, joins, transcriptions, etc.)
to the system, which will be integrated in the databases and displayed on the website
(with their names as the source) the very next day.
-
Research Achievements in Digital Image Analysis
Physical attributes
Remarkable results were achieved by the advanced computerized analysis of a manuscript’s
digital image. The starting point was to try and discover what physical attributes
of a Genizah fragment could be automatically deduced by the computer through a fine
analysis of its digital image. We develop a few software modules that, through a
computerized analysis, can allow the system to:
- recognize and follow the exact contour of the textual part of the image, thus separating
it from its background;
- measure the fragment’s inner and outer dimensions;
- count its number of lines;
- compute the average written-line width and length, the average inter-line width,
the average “text density” (the number of letters in a specified measure unit);
- compute the existence of margins and their average dimensions; and more.
It was thus proved that this type of data, considered essential in the study of
manuscripts and partially found in catalogs of manuscript collections, which until
now had been marked manually by scholars with a notable waste of precious research
time, can now be extracted automatically from the fragment’s digital image with
much more accuracy and efficiency. We implement these modules on the complete set
of Genizah images, and the data derived by this process is integrated into our databases
and displayed on the Genizah website
Suggesting joins
A crucial further step was achieved when we succeeded in developing a complex program
capable of analyzing the handwriting in the images of two different fragments and
asserting the probability that both were written by the same scribe (and so, perhaps,
originate from the same manuscript). This is not done through the analysis of the
individual handwritten letters and their shapes, but rather through a global comparison
scheme, vaguely similar to the way in which two portraits can be compared by computer
and found to be of the same person.
The basic idea is to match each of the Genizah fragments with one another so as
to obtain a similarity score for each pair of fragments. Out of an estimated total
number of 320,000 fragments, about 230,000 fragments were available to us, represented
by 450,000 digital images, with two images per fragment (recto and verso). For every
fragment, a numerical signature vector was computed, encapsulating aspects of its
writing style. With a specially designed software component that measures the readability
of every fragment, we eliminated from this scenario most fragments with poor legibility,
those that most likely would not contribute true joins but rather would deteriorate
the effectiveness of the system. These included blank or almost-blank pages, illegible
or very dark texts, minute fragments, etc. After eliminating these problematic items,
we were left with a total of 158,000 fragments to be compared with one another.
That gave a total of 12.4 billion pairs that needed to be measured for similarity,
a huge number indeed. Some twenty different similarity scores were computed and
stored for each pair. These were generated by using four different algorithms to
represent the handwriting style of each document and by using different similarity
measures between documents. The different similarity scores can be "stacked" together
to achieve higher accuracy.
During June 2013 twenty CPU's from the Computing Lab of the Blavatnik School of
Computer Science at Tel Aviv University ran together continuously for 37 days (the
equivalent of some 18,000 computing hours), and the task was accomplished. This
computer run is probably one of the most intensive ever implemented in a digital
humanities context, in terms of computing resources. Four terabytes of output were
generated in the process.
An efficient and compressed database was built to preserve these results in a structure
that is easy to manipulate within a reasonable on-line response time. For each fragment,
the top 300 similar fragments were precomputed.
In the Join Suggestions page there are three buttons that allow the user to use
any of three different algorithms (described below) and to receive relevant join
suggestions.
OSS-S4 (the basic algorithm)
The algorithm ranks fragments based on stacking 4 "learned" One-Shot Similarity
measures. Up to 100 results that were not presented in the clustering algorithm
will be displayed.
Graph BCC- Bi-Connected Components algorithm
This algorithm builds a graph based on Bi-Connected Components and isolates the
bi-connected component of the given fragment. The results obtained are fragments
that are connected to the requested fragment by at least two independent pathways
in the graph. The advantage of this algorithm is that it eliminates many irrelevant
suggestions. It is possible, therefore, that for a given fragment there will be
very few suggestions or no suggestions at all.
This algorithm was developed by Prof. Zev Wilkovitz and Dr. Zechariah Frankel from
the Software Engineering department in the Ort-Braude College in Carmiel.
SCSS-1000- statistical graph model
The algorithm is based on advanced methods of graphical models, allowing the construction
of a graph for the 1000 fragments that are the most similar to the given fragment.
The algorithm isolates and displays the cluster containing the given fragment. The
special features of the algorithm help find and offer new suggestions, even those,
which according to the basic algorithm are seemingly far - according to the proximity
measure of this algorithm- from the given fragment.
The Joins tool was developed as a joint project with Prof. Lior Wolf and Prof. Nachum
Dershowitz of the Blavatnik School of Computer Science at Tel Aviv University.
Jigsaw
When trying to test a hypothesis about the possibility of joining 4 or 5 fragments,
say, into one folio, a researcher may invoke the function “Jigsaw,” giving it the
numbers of these fragments’ images. The images will then be displayed on his (preferably
large) screen, where he can rotate or move any of them in an effort to fit them
physically together, as in a real puzzle. If satisfied, he can then store the final
image on the website.
To make the system even more user-friendly and intuitive to researchers, even if
they are not completely at ease with computers, a large 42" touch-screen was installed
in the Genazim lab, as a prototype, with an attached PC on which the website and
the software were installed, all completely transparent to the user. Using a virtual
keyboard, the user approaches the system by inputting an image number, receives
back the images of the 100 best potential candidates, marks some of the relevant
ones, passing them over to Jigsaw, where they can be easily manipulated – moved,
rotated, flipped, calibrated – by just touching the screen with one's fingers, much
like what one is used to doing nowadays with smartphones, and as naturally as one
might arrange a jigsaw puzzle spread out on a table.
Word-Spotting
The Word-Spotting tool makes it possible to locate Genizah fragments that contain
a given word, not only in fragments that have been manually transcribed (as with
the full-text search) but in any Genizah fragment, by searching for the image of
the word rather than for its textual representation.
The user highlights a search word in the image of a given fragment, and the program
will then scan the digital images of the entire Genizah collection, to locate and
retrieve fragments that contain an image similar to the one highlighted.
The query can emanate from any Genizah fragment, but because of the very large computer
resources needed for this operation, results will be restricted temporarily to the
University of Cambridge Library Taylor-Schechter Judeo-Arabic collection (CUL T-S
Ar).
For the same reason the search is not processed online; the user will receive an
appropriate message when the results are ready (usually just a few minutes after
submitting the query).
The WS tool was developed as a joint project with Prof. Lior Wolf and Prof. Nachum
Dershowitz of the Blavatnik School of Computer Science at Tel Aviv University.
Mobile Application
Genazim mobile app enables any user to view any of the 460,000 images of the Genizah
manuscripts available in the Friedberg Genizah website on mobile devices: smartphones,
i-pads, tablets, etc. Access to images is very fast and requires only the typing
of the image FGP number. The application includes basic image processing functions
such as zooming in and zooming out.
The app can be downloaded free of charge from Google Store (Google Play) or from
Apple Store (App Store) under the name: Cairo Genizah