OpenHackNYC gave me an excuse to start playing with the Yahoo Query Language. With YQL, you build a binding to datasource, an XML table or other web service, and use an “expressive SQL-like language” to manipulate the data. Instant database functionality with minimal overhead.
The Heilbrunn Timeline of Art History has these great individual timelines of key art/world events. The timelines are classified by time period and geographically with an encompasses free text string, e.g. Eastern Europe and Scandinavia, 1600–1800 A.D. encompasses “Belarus, Denmark, Estonia, Finland, Iceland, Latvia, Lithuania, Norway, western Russia, Sweden, and Ukraine.”
I’d like to point to all of the related timelines for an arbitrary work of art – provided it has an associated geographical term and date range. I scraped all of the timeline links, titles, dates, and encompasses strings, formatted all the data in an XML table, made a simple binding, and rigged a pipe front-end.
USE "https://netfiles.uiuc.edu/pdadamcz/www/museumpipes/yql/Timelines.xml" AS Timelines; SELECT * FROM Timelines Try it in the YQL console.
YQL / Timelines Pipe
USE "https://netfiles.uiuc.edu/pdadamcz/www/museumpipes/yql/Timelines.xml" AS Timelines; SELECT title, link FROM Timelines where encompasses like "%Poland%" and datebegin > 1000 and dateend < 2000 Try it in the YQL console.
//TODO: The entire vocabulary of the encompasses strings isn’t exhaustive – understandably. But maybe I can pass the query from the pipe through a geo service to find neighbor terms and send them all through the YQL statement?
Another feature that could be helpful is a display of relative size. ArtsConnectEd has a solid implementation (for example – scroll down to the details, and click the Scale tab). The Met provides dimensions (in a number of formats, grumble) but does not present any cues as to the relative size of the work. So when comparing works with similar aspect ratios like Kensett’s Lake George and Homer’s Prisoners from the Front, it would be easy to assume that the works are similar in size.
Parsing the tombstone gave me width and height in centimeters. Note, I’m only handling dimensions formatted as (ABC x XYZ cm) like those in American Paintings and Modern. WolframAlpha gave me an average human height, 162 cm. And AIGA has all of the classic Symbol Signs available… I set 1 pixel = 1 cm so things wouldn’t get too large. The only sneaky bit was using Google Charts to draw the scaled rectangle for the work of art, just a bar chart with width and height set to the dimensions of the work.
//TODO: Extend to other dimension formats. Handle 3D works. And for smaller works, coffee or martini?
The V&A just released a great beta of their collection search. I really like the jQuery Wall they’ve provided for browsing the collection – it’s an (apparently) infinite canvas of objects from the V&A collection with a movable viewport. The SFMOMA ArtScope does something similar, but I imagine the Wall could be a bit more flexible and maybe a bit quicker loading once some of the interaction lag gets ironed out (maybe some added visual interaction cues as well?). Looking at these reminded me of two tools that help in making tilesets of large images for use with Google Maps.
I collected the enlarged images of the 100 Modern department highlights (these are width limited to 500px with a variable height), made a 10 x 10 contact sheet in Photoshop, and had the Image Cutter break down the resulting 5,000 x 8,000 pixel image. Allowing for zooming down to level 7 on the Google Map took 5,461 256 x 256 pixel tiles.
I combined the automatically generated Image Cutter Google Maps code with a few functions to load markers from an XML file. I collected the URL for each highlight with a quick processing sketch. It took some trial and error to figure out how to place the first few markers but the uniform spacing helped once reference points were set.
This simple Google Maps solution isn’t nearly as dynamic as either the V&A Wall or the SFMOMA ArtScope, but I like how it has the potential to move quickly between giving a broad overview of the collection, to showing details-on-demand, and then providing multiple detail or high resolution images all in a single interface.
//TODO: Polygon overlays could be used instead of markers to make an entire collection object clickable. More information could be added to the tooltips – I was having trouble with special characters in XML CDATA.
Since putting together the New York Times identity network, I’ve wanted to look more closely at a larger network of art identities and subjects. I reworked some of the OCLC pipes that pull related identities and associated subjects from an Identity page to output something a bit closer to a TouchGraph data file, wrapped the whole business in a processing sketch, and had it crawl 100 objects from the Met’s Modern Art department.
Modern Art / OCLC Network TouchGraph (detail)
After some data cleaning, the network contains,
~3,500 nodes = 1,200 related identities + 2,200 associated subjects + 100 Modern Art Records and
~7,200 edges = 1,700 -> related identities + 5,500 -> associated subjects
The same terms can appear as related identities and associated subjects. As in the image above, Jasper Johns the associated subject is selected, while the identity is in the upper left. I’ve color coded the nodes in the graph (blue identities, gray subjects) and they are distinct in the data.
For a network this large, TouchGraph works well a single node at a time, but extending the locality stressed out my machine and I still wanted to see the whole network. Pajek to the rescue. Below is the 3D force-based layout of only the identities.
Modern Art / OCLC Network - Identities Only
Better, but I still wanted a closer look. Pajek exports to X3D.
Used Octaga to render; took a quick screencast…
Directionality is missing from the images but the edges only go from the numbered nodes (the starting set of Met Modern Art records) to Identity records from OCLC.
The major nodes are those you might expect. Each associated subject is presented in a tag cloud on the Identity page with a variable font size. I’ve used those sizes as edge weights where appropriate and summed them across the network here.
Associated Subjects
Sum of Weights
Exhibition catalogs
412
Criticism, interpretation, etc.
377
Catalogs
326
Biography
318
Art
298
United States
295
Artists
246
History
244
Painters
236
Art, Modern
165
Related Identities
Occurences
Museum of Modern Art (New York, N.Y.)
24
De Kooning, Willem 1904-1997
11
Picasso, Pablo 1881-1973
10
Pollock, Jackson 1912-1956
10
Rothko, Mark 1903-1970
10
Marin, John 1870-1953
9
Matisse, Henri 1869-1954
9
Weber, Max 1881-1961
9
Braque, Georges 1882-1963
8
Stieglitz, Alfred 1864-1946
8
(Ahem, Metropolitan Museum of Art appears only 3 times in the network.)
I’ve started looking at the network metrics in UCINET and Pajek but I think there has to be something said about validity at this point. What we have is a two-mode network, i.e. a bipartite data set. Not a problem; plenty of ways to look at the data. But this is more an artifact of the data collection method than reality. Object records don’t point to one other and, since I didn’t iterate, there are no connections between the collected nodes. Of course the validity of the whole data set is pretty dubious. My selection criteria were intentionally broad and uninformed – picking the top 3 identies from OCLC and then pulling in everything, ignoring rank and weight in the data collection phase. The initial goal of the pipes was to find out more about the quality of the results from OCLC – to see if a simple query would suffice. So the pipe structure will need to change if we want validity. I don’t know nearly enough about how associated subjects are mapped to identities, or how an identity is “related” to any others, or for that matter how complete the coverage is for Modern Art in OCLC. Ultimately, any analysis will be saying more about the OCLC data surrounding books rather than about the Met’s holdings. I’ll be sure to present the network analysis metrics on a more “complete” dataset.
With all of that criticism about lack of rigor out of the way; Wow. With a large enough starting set, the resulting network gets rid of the noise pretty well. I think this network is a good place to start with clear direction for improvement.
I received a special request for an Internet Archive pipe. Starting from the advanced search page there was plenty to work with. The Advanced XML Search form returns whichever record fields you might want as XML, JSON, CSV, or an HTML table. The form exposes all of passed parameters in the search response URL, making it straightforward to rig a pipe to create a well formatted query.
Internet Archive Search Pipe
Some convenient details in the pipe. Adding long strings in the Pipes interface is annoying due to the short textbox lengths so having each of the record fields added to an array, fl[], makes it easy to see all the parameters. And the Pipes team have added a Create RSS module which makes converting the returned fields to an RSS feed much cleaner.
One quibble with the data format from the Internet Archive; the record fields are returned in XML as repeated elements which makes it just a little harder to manipulate. The JSON response is great with every field placed in a distinct element.
I tried tuning the quality of results. By default the search string John Singer Sargent gets translated into this baroque query:
(title:john^100 OR description:john^15 OR collection:john^10 OR language:john^10 OR text:john^1) (title:singer^100 OR description:singer^15 OR collection:singer^10 OR language:singer^10 OR text:singer^1) (title:sargent^100 OR description:sargent^15 OR collection:sargent^10 OR language:sargent^10 OR text:sargent^1)
Parallel Sets is a tool and visualization method for exploring categorical data. Multidimensional data is going to be hard to present without significant design work and hard to interpret for most information seekers. There is a learning curve with these graphs, but once you get used to them they really are very rich and easy to query.
European Paintings: Medium, On View, Component
European Paintings: Medium, Not on View / On View, Component
American Paintings and Sculpture: Medium, On View, DateEnd
American Paintings and Sculpture: Medium, On View / Not On View, DateEnd
American Paintings and Sculpture: DateEnd, On View, Medium
I think the datasets may have stretched the tool a bit. Labels and scaling got a little wonky, but once the data was filtered to a more reasonable set of values along a dimension, brilliant. The order in which the variables are added to the visualization can change the presentation dramatically which really helps in answering different sets of questions.
This could be getting closer to a chart that would be useful for at-a-glance comparisons across collections.
I’ve been surprised by how many hierarchies can be extracted from aggregated museum object data. I’ve always liked the look of John Stasko’sSunBursts. Like the treemaps, another space-filling hierarchical display, but radial and a bit mesmerizing when the interaction is done just right. I repurposed some of the arc diagram code and made a quick SunBurst processing sketch. I mashed up / pared down some othervisualizations into another really basic sketch. I’d like to think I can combine a few more visualizations in interesting ways (e.g. Bloom Diagram) – still just sketches rather than full applications; not done exploring.
All of the visualizations, the more practical and the speculative, are meant to augment collections navigation and search in some way. More for browsing than directed search tasks, but maybe a bit of both… Reading Marti Hearst’s great survey, Search User Interfaces (ch. 10 in particular).
//TODO: Refine one of the sketches to include some basic interface widgets. And some event triggers – though the visualizations are starting to look OK, interaction is going to take a while to get right.
Some of these are crude, maybe come off a bit clumsy – but early days yet. I’m still getting a handle on what content I can really use, and still have a load of questions; how precise is the geography data? how reliable are the dates? are there any meaningful connections between object records already noted in the metadata?
Where were all of the European Sculpture and Decorative Arts works made?
When were the works in American Paintings and Sculpture made?
What media are represented in American Paintings and Sculpture?
Do we have anything made of “shell” in American Paintings and Sculpture?
What kinds of work do we have from 1872 in American Paintings and Sculpture?
When were the bronzes in American Paintings and Sculpture made?
How old are the books in our library and did we focus our collecting on different topics over time?
I’ll post links to working versions in subsequent posts once I start refining a bit more. I’m having trouble with permissions in Google Spreadsheets. Even with the data shared with everyone to read and the sheets published, I’m still getting a “user not signed in” error. Any ideas?
//TODO: Move the data into Processing. ManyEyes and Google Vis don’t seem to have good hyperlink capabilities – so no easy way to connect individual object records with elements in the visualization. Look for a simile timeline + map mashup. Maybe move to Prefuse? Flare?
The Google motion charts were a breeze. Post the data to a spreadsheet, make sure the data is formatted correctly, and set the spreadsheet as the datasource using the Google visualization code – all there is to it. As always the code and data are open – though most of the code just came from Google’s interactive examples. Motion helps tremendously in observing trends.
Google Motion Chart
My source in the library passed along some circulation statistics which I added to the spreadsheet powering the ManyEyes visualizations. Brilliant stuff. ManyEyes has a treemap comparison option that shows the relative difference between two variables in the hierarchy. Below is a comparison of the Item Count and Circulation Count – the lighter the region, the higher the percentage of books in that category were checked out over the past year, i.e. lightness equates to percent utilization.
Why all this work on a dashboard, charts, graphs, and so on? The pipes work really started out of a frustration that collections information lived in a vacuum – without any sense of the interconnections between any two object records let alone any material outside the collection. But with internal context now becoming increasingly available and abundant external connections (with relevance still an issue but not a roadblock), the problem is turning into an intriguing special case of a more general problem with a well defined approach from an information visualization perspective: overview first, zoom and filter, then details-on-demand. That mantra is probably best expressed in Ben Shneiderman’s work and many of the visualization projects at the HCIL.
What doesn’t seem to fit is the austerity of the visualizations. Well, that, and the lack of concrete questions that need answering without overspecialized visualizations. The overview, zoom, filter, details approach gives incredibly good exploratory tools but they move too far away from the spirit of the data for my taste. At the other end are the more designer-ly approaches that seem to focus on their own aesthetics rather than reveal more about the data. I don’t want this to sound like the usual data vs. design debate. I think Museum data has the enviable trait of not needing too much abstraction to become compelling – quite the opposite in fact. Maybe its the details-on-demand part that’s the problem? Maybe the details about a given data point just need to bubble up sooner in a Museum context to make for a compelling experience.
I’ve started with library data, I’ll try to move on to aggregate collections information next.
//TODO: Pass along some of the data to the JavaScript InfoVis Toolkit. It has some great demos and the coding looks lightweight; The treemaps aren’t very pretty though. I remember the treemap code in Ben Fry’sVisualizing Data was easy to work with.
Dashboards are all the rage. I’ve been able to get some museum library information to power new charts and start working out what kinds of questions we might be able to answer. What works are best represented in the libraries? What kinds of connections are there between special exhibitions and library research traffic? …
Watson Library Dashboard (detail)
Watson Offsite Storage - World Map
I’m using Many Eyes as a prototyping environment, but will be replicating some of the work in the Google Visualization API. It has a few more chart types and seems to be a bit more stable. All of the data is being stored in Google Spreadsheets.
The treemaps are really coming out great – the one above breaks down our offsite storage by LoC call number. The Library of Congress classification scheme gets to a fairly low level of granularity, and though it doesn’t quite match up with departments it still gives a good (partial) view of our collection development policies.
//TODO: The Google visualization API seems straightforward – make a motion chart of all of the related time series. Pulling public access catalog information out of Innovative seems to be a real pain. I’d really like to match library search queries against sitewide searches. Although I’m avoiding dusting off any Time Series Analysis textbooks – but thisone wasn’t that scary.