Japanese Woodblock Print Database
January 6, 2013 12:13 PM   Subscribe

Japanese Woodblock Print Database
A database and visual search engine of over 200,000 Japanese woodblock prints. Starting in the early 1700s and exploding in popularity throughout the 1800s, Japanese woodblock prints depicted the fantastic world of Kabuki actors, courtesans, warriors, and nature. The style of the prints feels particularly modern and vibrant, even today a couple hundred years later. This project aggregates prints from a number of museums, dealers, and auction houses into a single searchable resource.

This project was started as a tool to aide myself in my research into Japanese woodblock prints. I found it to be incredibly difficult to even perform simple actions like locating a print or finding more information about a print.

This project has a few key components: 1) Prints are aggregated from a number of different sources around the globe (currently 23), all of which have various levels of usability (or un-usability, depending upon how you look at it). 2) Wherever possible artist names have been simplified and translated into English (making a number of Japanese-only collections searchable to lay scholars). 3) An image search engine has been implemented, making it possible for people to take pictures of prints that they see and find out more information about them.

As it stands this is a unique resource within the realm of prints or really art history studies in general. I've been working on it for a little over a year now and I want to get it out for people to play around with and get feedback on. I'm particularly immersed in Japanese woodblock studies so I would love feedback on ways in which I can make this more appealing and understandable to others that are new to this subject.

On the technical side, this project was done using the following technologies: Node, Express, jQuery, Bootstrap (web site), SOLR (text search engine), MatchEngine (image search engine), Node (for scraping), and EC2, S3, and Cloudfront (for hosting site and images). If you have any questions about how the site was implemented - please feel free to ask away!
Role: programmer, designer, researcher
posted by jeresig (16 comments total) 19 users marked this as a favorite
This project was posted to MetaFilter by carsonb on January 7, 2013: Japanese Woodblock Print Database | Ukiyo-e Search

Did a single search, for 'firefly', and wow. This is a lot of fun.
posted by sciencegeek at 1:44 PM on January 6, 2013

Excellent !
posted by Bighappyfunhouse at 2:13 PM on January 6, 2013

Glad you're enjoying this!

@sciencegeek: That's a good search! Other generic searches produce good results as well: monkey (one of my favorites), fish, toad (another fun one), bird, and flower. Birds and flowers have a ton of results and can be easily broken down by type (chrysanthemum, plum, cherry - finch, titmouse, crow, eagle). It's fascinating to see how artists' interpretations of this commons symbols differs. Another fun search is to look for common characters, costumes, or dances. For example 'shakkyo' will get you the traditional "lion dance" from Noh and Kabuki or a god like 'fudo' is another fun result.

I should definitely pull together a master list of a bunch of these - thank you for bringing this up!
posted by jeresig at 4:29 PM on January 6, 2013

Jaw agape. This is going to be a deep rabbit-hole for me. Although I note that some of the alternative versions of a print are in fact completely different prints, eg this.
posted by adamrice at 4:35 PM on January 6, 2013

Put this on my FB page and others have suggested searches:
crow - great minds think alike.
posted by sciencegeek at 4:38 PM on January 6, 2013

oh yeah, umbrella is nice too.
posted by sciencegeek at 4:39 PM on January 6, 2013

@adamrice: Glad you're liking it! So you've actually found an interesting case: These prints are, in fact, visually different - however they have many similar characteristics and, in fact, they are from the same print series (Yoshitoshi's 36 Ghosts series). Note the black crest at the top of the print - this is identical in all the prints (it's actually the title of the print series), also note the similar colored crest in the upper-right-hand-side (this is the title of the print), finally note the color bar at the very bottom of all the prints (the Tokyo Metropolitan Library adds these to the images of their prints, as do other institutions). These three things combined result in the image search identifying these three prints as being visually similar. So it's not perfect but it's still rather good (since these prints are related, being from the same series). I've seen this in other cases as well but it's relatively rare, and it really only seems to happen when there's a color bar in place. I could do more to crop the images but that would be a lot of work. Something to consider for later.

@sciencegeek: Some great ones! Some others: wave, whirlpool, waterfall, mountain, and fuji.
posted by jeresig at 6:04 PM on January 6, 2013

You, sir, are a metadata god. What an amazing job. My first search? Tanuki. Squee!
posted by ReginaHart at 6:29 PM on January 6, 2013

@ReginaHart: Awesome - and thank you! I see a couple more when searching for racoon as well. If you love Tanuki there's a Japanese database (of monsters and other strange creatures!) that I haven't added in yet, that you might be interested in.
posted by jeresig at 6:39 PM on January 6, 2013

Swoon... I also ran Jizo, oni, and yokai. The link you just posted has some fantastic images as well.

I would be very interested in reading/learning more about the technical aspects of the project. I'm a librarian who really digs metadata, databases, and digital collections.
posted by ReginaHart at 6:58 PM on January 6, 2013

@ReginaHart: Oh, that's awesome - I feel like I'm a little bit of a librarian at heart so it pleases me to hear you say that :)

To explain the architecture of the project a bit:

I've been compiling a list of all known sources of digital Japanese woodblock prints for a couple years now. This is my list of every resource with at least 200 images, online, that I know of. And this is my list of every single resource that I know about, including resources that haven't been digitized yet or that I haven't figured out yet.

The first step in "acquiring" a new resource is writing a basic data collection module. Thus far I've used a single unified model for most of the sites that I've added, which has worked well. (I'll need to develop some new models in the future but I'll get to that bridge then.) The basic data collection module collects three things: pages of search results, pages detailing a print, and images of prints. I end up with a directory that has a collection of all of these things. A tricky part in here is coming up with unique IDs for each of the pages and images (not all sites are consistent, nor are the URLs always the same, which is frustrating).

The data collection and extraction is done with a series of scripts and uses XPath expressions to actually get the information out of the page.

Logistically I'm usually analyzing the content structure of a page inside my web browser - I use Chrome and the web developer console and test my XPath expressions in there. Many of the collections that I pull from are in languages that I don't read (such as Japanese, German, or French). Chrome's built-in language translation is an absolutely essential feature and makes it possible for me to understand the structure of the page.

The images are also converted into a variety of formats at this point - normal full size image, a scaled image (where the smallest dimension is no more than 300px), and a thumbnail image and they're all uploaded to Amazon S3. I should note that I try really hard to get the highest resolution image possible (and I frequently work around built in protections that museums throw up in order to get to those images -- the images are all technically in the public domain and they deserve to be free for all to use).

At this point I have a collection of files, images, and metadata. I do one more pass on the metadata fixing up common mistakes, formatting issues, and correcting incorrect or uncommon artist names. Additionally I use the opportunity to figure out the canonical artist name. At the moment this is kind of hacky and I hope to make some improvements there.

I've already constructed a list of canonical artists ahead of time. I did this by going through the artists of all the prints contained in the British Museum (they already have the artist names, which is expected, but ALSO have the Japanese kanji characters associated with the artist). This gives me a mapping of English and Japanese names pointing back to a canonical artist.

For example: "Hiroshige, Hiroshige I, Ando Hiroshige, Ichiryusai Hiroshige, Utagawa Hiroshige, Utagawa Hiroshige I, 歌川広重, 広重, 歌川広重〈1〉, 広重〈1〉". All of these map back to a single artist: "Utagawa Hiroshige". This way I'm able to automatically "translate" Japanese artist names into English. I'm actually very proud of this. I've been able to translate a number of Japanese-only collections into English for the first time (such as the collection of the Tokyo National Museum).

Now that things are correct and translated I insert the data into two data stores: My SOLR text search database (this is what you use when searching for a text string on the site, it supports searching in English and Japanese and also fixes romanji formatting issues) and also the MatchEngine image search service. The Tineye folks have graciously provided me with access to the MatchEngine service and let me just say that it is AMAZING. Easily the best commercially available image search service available - and it has a very easy-to-use API on top of that, as well.

With all the data into the two databases the rest is relatively simple (from a web app building perspective). I create a way to upload images for passing them along to MatchEngine and finding similar images and I have a way for browsing through SOLR search results (which is the majority of the site). Now a lot of work has been done to arrive at this point, tons of data finessing, trying different tools, writing scripts, etc. I've been working on it off-and-on for over a year now and I'm finally feeling good about the progress I've been making.

I hope that helps to clarify some of your questions!
posted by jeresig at 7:40 PM on January 6, 2013 [4 favorites]

Thanks jeresig for explaining the tech stack!
posted by Foci for Analysis at 12:16 PM on January 7, 2013

Wow, thanks so much for offering a peek beneath the hood. I'm even more impressed that you've done it all without being able to read Japanese! (I sure do wish I could.) You've done a marvelous job creating authority records for the artists' names - no small feat. I don't know if these resources would be of any help to you in standardizing/coordinating your metadata, but just in case you haven't run across them, check out the Library of Congress Authority Files for Cataloging Pictures. When you mentioned mapping the artists' names, I immediately thought of the Getty's Union List of Artist Names. Hopefully you'll find a few useful tools or ideas in those links. Happy digital library building!
posted by ReginaHart at 5:16 PM on January 7, 2013

This is really wonderful, jeresig! I was looking for some Meiji era depictions of industrialization; and this is a great way to collect resources from different collections. Resolving all the various names on prints and various catalogues to canonical names seems to be one of the real tricks here, it would be interesting to know why you think this is 'hacky,' and if you also think you can make this smoother in the future. It's also kind of interesting that you had to build the list yourself.
posted by carter at 7:56 PM on January 7, 2013

jeresig: we don't have that many online but everything on http://www.wdl.org/en/search/gallery/?additional_subjects=Woodcuts&country=jp has high quality metadata and the item pages have schema.org microdata which should be reasonable to work with.
posted by adamsc at 7:43 AM on January 8, 2013

@ReginaHart: I was not familiar with those authority files and it's proving to be exceptionally helpful - thank you so much! I'll be digging into them in ernest.

@adamsc: That's fantastic! I love the write-ups of the individual prints and the clear metadata. Thank you!

@carter: Right now the name detection is very basic, just checking to see if the name matches certain words. This works for many cases but fails for the hard cases. A couple examples that are really hard to catch right now:

- Artists frequently changed their names during their career. Right now I have to manually correlate the different artist names together.
- Artists frequently re-used names. For example "Sadanobu" may have been used in the 1700s and then re-used again in the 18 or 1900s. The only way to fix this is to have date ranges associated with the names and then check the date on the print and correlate that back to a specific artist.
- Sites are frequently very vague about which artist they are referring to. I have a number of sites that just say "Artist: Toyokuni" and it's implied as to which Toyokuni they're referring to. I think this might be a case where the date detection might help but I'm not sure yet.

I think the artist detection is going to have to move from a binary yes/no to a graded system. I'll need to use multiple data points (artist name written in english, artist name written in Japanese, birth date, death date, date of print) and end up with a score, which I could then use to confidently assign them a name.

Oh, and dating of prints is a complete mess as well. So many sites just punt on providing accurate dates. For example the Tokyo National Museum just says something like "Edo Period" (260 years!) or "18th Century" (when the artist only ever lived in the 18th century). Once I have better dating on the artists I can then start to narrow down the date ranges of the prints as well.
posted by jeresig at 9:55 AM on January 8, 2013 [1 favorite]

« Older Metropho.rs...   |   Goalfinger... Newer »

You are not currently logged in. Log in or create a new account to post comments.