Digital Humanities 2013: session notes

made with cirrus via voyant tools

As I often do at library conferences, I’m publishing my lightly edited session notes from Digital Humanities 2013 held in Lincoln, Nebraska. First and foremost, I want to stress what a rewarding and worthwhile experience coming to DH has been. As a librarian engaged closely in the efforts to support and enable digital scholarship, it is critical for me to hear firsthand about the work being done, as well as to tap into many of the issues around DH and its current form and state.

Please note that these are just notes, so please comment with any corrections, rebuttals, improvements, or questions. In general, it should be clear where I’m taking notes and where I’m editorializing, but where it’s not, I’ve put my comments in italics. Also, even though most of the talks list multiple authors (findable via the links), since I’ve used third person singular references for the most part in my notes, only the presenter is listed below.

One last note: I took and prepared these notes using Draft, a super lightweight and sensible online editor that also direct publishes to WordPress and other outlets. Highly recommended

Wednesday

Uncovering the “hidden histories” of computing in the Humanities: findings and reflections of the pilot project
Julianne Nyhan, University College London

Posed a battery of questions at the outset, including the simple one concerning why everyone works in centres. Others concerned knowledge transfer, access to resources, judgment, etc. The key point is that there is no real history of computing in the humanities, so it’s time to explore this. Their method is oral history, focusing on individuals, not projects, specifically people working in the field between 1949 and 1989. They pose questions about their work and the use of computers.

She resists drawing out common themes from the oral histories, choosing instead to focus on individual experience. That said, there are a few themes, among them the notion of training and where they got it. Another theme she hit on was the connection between DH and revolution, and how often the label is applied to DH. Her project to some degree is about pointing out that DH builds on other work, from within the humanities and without, so it’s not really a revolution.

She showed some telling quotes from various practitioners. Clearly, there was a lack of respect within the humanities for what they were doing, and that many saw computers as an intrusion or an attack, or at best as a fad.

Are Google’s linguistic prosthesis biased towards commercially more interesting expressions? A preliminary study on the linguistic effects of autocompletion algorithms
Anna Jobin, EPFL, Switzerland

Caught the end of this Google auto-completion talk and wish I had heard the whole thing. What I saw was really interesting.

Digitizing Serialized Fiction
Kirk Hess, University of Illinois at Urbana-Champaign

The fiction in question were stories published in farm newspapers in the early 20th century. This involves newspaper digitization, which is mainly done from microfilm, and requires segmentation which is labour intensive and costly.

Current newspaper archives do not enable finding serialized fiction. No metadata, poor OCR (means titles can have typos), lack of linking between issues. Also, pagination with newspapers is random, i.e.- an article can continue on any page. OCR is one of their major challenges: lots of errors, not a lot of people. Built the site in Omeka with the Scripto plugin. They also considered Islandora, but it was more complex.

They applied what he called “lite TEI” to the corrected text. At the end of the project, he asked himself how one could do it automatically, rather than finding it manually and compiling it in spreadsheets. Perhaps possible by looking for certain n-grams: chapter, to be continued, the end. Also pointed to another series of possibilities, such as network analysis, named entity extraction, etc.

The German Language of the Year 1933. Building a Diachronic Text Corpus for Historical German Language Studies.
Hanno Biber, Evelyn Breiteneder; Austrian Academy of Sciences

Austrian Academy Corpus is a massive text corpus that covers 1848-1989. Contains various historical and cultural texts and can be used as a basis for various projects. Wide array of text types present.

Watching his talk, it occurred to me that this kind of corpus analysis is, on one level, an effective response to revisionists who want to paint Austria’s history with a prettier brush. One can, by showing small linguistic shifts, follow trends and ideas as they emerge, even from texts where one would not typically go for this kind of information. In other words, this kind of analysis can only succeed using the tools and methods of digital scholarship.

Finished by showing Kraus’s “Dritte Walpurgisnacht.” Penned in 1933, it closes with a quote from Goethe about killing tyrants. Alas, we didn’t get to hear more of this aspect since he ran out of time. Will publish their results next year.

Digging Into Human Rights Violations: phrase mining and trigram visualization
Ben Miller, Ayush Shrestha; Georgia State University

Their work relates to human rights. Mentioned the example of the Bosnian book of the dead, where quantitative work was done to deduce the list of victims. He also believes that it’s possible to do qualitative work that gives testimony to their experiences.

To do their work, they used a collection of first responder accounts from the World Trade Center. They began looking at ways how to identify the appearances of individuals in such accounts, since sometimes people are mentioned by name and other times not. They’re using, as he put it, fairly standard natural language processing techniques. Since the end result of this work is to provide accounts to tribunals, their work has a higher standard applied to it.

Some of their data comes in structured form, with dates, locations, etc., while other sources are unstructured, such as narratives. The former can be used nearly as is, while the latter requires work to make it a viable source. They map events to location using something known as Storygraph, which tracks the movements of characters through events, as well as time spent. They also use Storygrams, where it’s less certain who was where and when. But the graph shows a relative sense of proximity and time. Showed screen shots of Storygraphs. Very compelling.

Automatic annotation of linguistic 2D and Kinect recordings with the Media Query Language for Elan
Anna Lenkiewicz, Max Planck Institute for Psycholinguistics

Generally speaking this was about adding annotations to linguistic gestures, i.e.- gestures that accompany speech (gesture studies). They use a Kinect for this work since it can create a digital skeleton of the subject. The gestures are then encoded using output from the Kinect (I believe), since it can recognize speed and direction.

VizOR: Visualizing Only Revolutions, Visualizing Textual Analysis
Lindsay Thomas, Dana Solomon; UC Santa Barbara

Only Revolutions is a novel (by Mark Z. Danielewski). Didn’t know that from the talk title, but it makes the title much more clear. Beyond visualizing the novel, they are interested in exploring visualization as a scholarly practice.

The novel is unusual in form, broken into various quadrants and textual types. Structure and pattern are critical elements. Reminds me of Arno Schmidt on a good day.

They had to pay attention to variant spellings and case patterns in their database (e.g.- Sep and Sept for September, or creep and CREEP).

Question: what do scholars gain from such visualizations? He had a compelling answer, couched in terms of being attentive to the text and its structure and instructions. Would actually enjoy reading his answer, which was quite detailed and made multiple references to other sources.

Ambiances: A Framework to Write and Visualize Poetry
Luis Meneses, Texas A&M University

Interesting: the second talk I’ve seen today that employs the Kinect device. Perhaps I missed something, but the demo he showed us struck me as the gamification of poetry. That’s a nifty idea in a world where it’s hard to engage students in such creative expressions. Then again, maybe I missed something.

As he put it, they’re trying to support new creative methods. So maybe I did get the point.

XML-Print: Addressing Challenges for Schoalrly Typesetting
Marc Wilhelm Küster, Worms University of Applied Sciences, Germany

This talk could have been called “yet another awesome substantive project funded by the DFG.” This has been a constant theme at conferences in recent years (CNI, DLF, DH, etc.). Germany is really doing great things for DH/DS.

XML-Print is what it sounds like, namely, a tool to take XML input and typeset it for publication. Typically the input is TEI-encoded, but not always. Did a nice job narrating a video of a demonstration of the tool. It’s available at www.xml-print.eu and via Sourceforge.

Bindings of Uncertainty. Visualizing Uncertain and Imprecise Data in Automatically Generated Bookbinding Structure Diagrams
Alberto Campagnolo, University of the Arts, London

So his topic really is historical bookbinding forms. His objective is to develop an XML schema to allow description of bookbinding forms and then generate visualizations. He observed uncertainties in his data, and he was unsure of how to visualize these uncertainties. That’s the main problem.

Possibilities of narrative visualization: Case studies of lesson-learned-oriented archiving for natural disaster
Akinobu Nameda, Ritsumeikan University, Japan

Took as his model the Great East Japan Earthquake of 2011. Pointed out that the record of such events can help us prepare for future events. They are attempting to find narratives within the mass of textual information generated in the wake of the earthquake.

Visualizing Uncertainty: How to Use the Fuzzy Data of 550 Medieval Texts?
Stefan Jänicke, University of Leipzig; David Joseph Wrisley, American University in Beirut

Using the geospatial-temporal data visualization tool GeoTemCo to work with place names from medieval French texts. Place names can be a bit vague, to say the least. GeoTemCo’s particular asset is the ability to do comparative visualization with multiple datasets. Also have used ThemeRiver, but he pointed out that for literary data one has to repurpose it.

Wrisley made the general point that the specificity of GIS doesn’t correspond well to the uncertainty inherent in the humanities, where there are hypotheses that have been reached by consensus, but are not strictly objective.

Thursday

Documentary Social Networks: Collective Biographies of Women
Alison Booth, University of Virginia

Her work investigates collective biographies, i.e.- tracks and documents biographies of women over time to discover networks. These networks are “involuntary” to use her word, and her work uncovers the relative degree of separation between women who are otherwise not brought into connection, e.g.- Sister Dora and Lola Montez. She refers to women connected in this fashion as “documentary cousins.”

Databases in Context: Transnational Compilations and Networks of Women Writers from the Middle Ages to the Present
Hilde Hoogenboom, Arizona State University

Women writers, particularly those of sentimental novels in the 18th and 19th centuries, were the most frequently translated authors of the time. So while they are typically excluded from national canons, transnationally they are more present in the form of compilations. Such compilations extend back to the 14th century, and continue forward in history. Her work illustrates how popular specific women were in translation in other countries, e.g.- George Sand in Russia.

Interesting sidenote: she’s not the first speaker I’ve heard mention where she published the work she is presenting, and as one would expect, it’s in a journal with a paywall. There is some irony in this. Digital humanities is about the new and bold, but they still pass their research output through the machinery of the publishing world. It would be interesting to investigate how many of these articles have been deposited in open access repositories (IRs and the like). Also, how can the static, textual form capture many of the expressions of the digital humanities, e.g.- interactive network maps, visualizations, etc. A journal article seems like a stunted and impractical outlet for such work. [In Friday's talks, I heard a number of substantive answers to such questions and concerns.]

On Our Own Authority: Crafting Personographic Records for Canadian Gay and Lesbian Liberation Activists
Constance Crompton, University of British Columbia, Okanagan Campus

Their original intent was to create a digital version of a print book (Lesbian and Gay Liberation in Canada by Donald W. McLeod). They’ve gone beyond that, it seems. One of the issues they ran into in their work that needed resolution was the category of sex. Markup requires specificity that doesn’t fit reality. Moving past her own work, she feels that a form of scholarly activism would be to encourage various authority record bodies to consider how they classify sex and modify their practices to reflect the ambiguity of reality. There is, for example, active cross-reference between sources such as VIAF and Wikipedia.

Research to Clarify the Interrelationships Between Family Members Through the Analysis of Family Photographs
Norio Togiya, Tokyo Institute of Technology

Involved creating network graphs from analysis of family photographs. This involves the creation of annotations with names inserted from authority files. The number of co-appearances in photographs is quantified, and with that one can build the network map. From these network maps they infer information about the changes in family relationships in one family over five generations. Their point is that the relationship between family members is not always clear from written documents, but the patterns that emerge in photographs provide critical information that allows one to make fairly valid assumptions about the psychological nature of the relationships.

Prosopography in the Time of Open Data: Towards an Ontology for Historical Persons
John Bradley, King’s College London

Essentially highlighted the notion that prosopographical studies and work are in a way linked data projects. What he suggests is that we need to move from closed to open prosopography, in other words, move beyond closed research teams working on contained domains toward greater collaboration and what he called “fuzzy” boundaries. As a librarian, this is music to my ears, but I also wonder how one realizes this since it requires a high degree of adherence to standards. He noted this, and said it’s a pity because they all have the same semantic goals in sight, but just go about it differently. What he’s proposing is that we need an ontology for historical persons, which would act as a bridge between competing standards. Great enthusiasm for his idea, and he noted that his talk was really just a way of talking up the idea, and he proposed a meeting to being the various parties together at his institution.

Lexomics: Integrating the Research and Teaching Spaces
Mark LeBlanc, Wheaton College

He was getting a little confused with what he was doing and wondered if he was doing DH. His conclusions:

our collaborations are the best part of his job
culture of labs in the humanities (it has arrived, he feels)
work at the research-teaching loop (i.e.- it takes work)
undergraduates: “no-levels of bluff”

He’s been teaching a “computing for poets” class since 2008, and now alternates with a colleague who teaches an anglo-saxon lit and Tolkien class. What do they learn? Abstraction, problem decomposition, programming, and the design/execution of experiments. One lesson they learn: negative experiments are OK! He puts his syllabus and all of his assignments online.

Also building Lexos, a tool that helps students prepare texts for experiments.

The Digital Scholarship Training Programme at the British Library
Nora McGregor, British Library

Brand new program that started in November 2012. The BL has about 300 curators in their collection area. They have four digital curators who support staff in other areas. Unlike the other curators, the digital curators currently do not have their own collections (but will in future).

Their goal is to make all curators aware of and familiar with digital scholarship so that they can apply their expertise in the new realm. This also drives innovation and facilitates cross-disciplinary initiatives.

In practical terms, it’s a two-year program that consists of 15 day-long modules. They looked at the digital humanities field and gathered information on what scholars are doing and how and used that as the basis for their curriculum. They also collaborate with external institutions such as UCL, Oxford, etc.

She showed a list of the courses, which range from a basic “what is DS” intro to presentation skills to TEI, etc. Hopefully I can find the list, but from a quick scan, these are not higher-order skills (i.e.- they’re not trying to make curators into IT staff). Seems more about awareness and understanding than skill acquisition, although there’s some of that, too.

When asked during the Q&A if the curriculum and materials would be made openly available (i.e.- as OER), she seemed to indicate that that was not planned, but would only happen in direct collaboration with another institution.

D the H & H the C (Digitizing the Humanities and Humanizing Computing)
Kevin Kee, Brock University; Spencer Roberts, George Mason

Their talk was about an introduction to DH course for graduate students. Their concern is how to teach DH to not-so-technically literate grad students, not least since some reject technology and consider themselves to be technophobic.

They used DH methods to research what they should include. They used Bill Turkel’s method. They landed on Voyant and IBM’s Many Eyes and used them to pull together data on what kinds of topics, readings, and assignments are used in current practice. Created a network diagram of which authors are being read.

They want to emphasize a playful approach in their course. Beyond that, they need to make it responsive to the students’ needs and apply the tools to content familiar to students. Hearing the word “playful” in the context of graduate education is refreshing, not least since it implies experimentation and expanding one’s view.

They make four recommendations:

continue Spiro’s DH syllabi repository (it was a great source for them)
continute the conversation around DH education theory
more research about practical course design; collect resources for instructors
student participation in the design

Meta-Methodologies and the DH Methodological Commons: Potential Contribution of Management and Entrepreneurship to DH Skill Development
Lynne Siemens, University of Victoria

Beyond the core skills needed for successful DH projects–content knowledge and skills; technical knowledge and skills–what other skills are needed for success? Her answers:

project management
collaboration / teamwork
human resource management
financial / budget
user needs / marketing
product or resource development
alternative revenue generation
(missed one Prezi bubble here, but even with that it’s a detailed list)

Should the Digital Humanities Be Taking a Lead in Open Access and Online Teaching Materials?
Simon Mahony, UCL Centre for Digital Humanities, United Kingdom

It was interesting that he used the PLoS OA page to introduce the topic of Open Access. He also sketched the history of Open Educational Resources, which dates back to 1998. In a nutshell, OER is teaching materials being placed in the public domain. He noted that OER materials are still fairly low use, and speculated on reasons why that might be, among them low awareness and the fact that many scholars do not share their own materials via such channels. The goal is to make it “normal working practice.”

There is a question of scale, too. It’s one thing to share an entire course, another to break everything down into logical learning objects and share at that level. The latter is more useful, since it fosters integration into other courses and allows for a variety of recombinations. There are a number of sustainability issues with OER:

investment (lots of money pouring into creation, but not leading to sharing and sustainability)
academic culture
format suitable for re-purposing
size suitable for re-purposing

He spoke a lot about UNESCO and developing countries, noting that putting materials online and allowing free access may not work in all environments. There is also a long list of other issues that thwart sharing: discoverability, localization, assessment, ownership, adequate metadata, no classification standard, etc.

One thing he discussed rings true for me based on my observations on the failure of library coders to make their code available, namely, that many people are afraid to put less than perfect materials out there for people to see. He pointed out that by doing so, people will improve on them, correct them, make them better, so it’s a win for everyone.

Pointed to a site that won a United Nations award as a successful example of sharing videos. Building on that, transLectures takes the audio tracks and transcribes and translates the text.

Collaborative Technologies for Knowledge Socialization: The Case of elBulli
Antonio Jiménez-Mavillard, Western University, Canada

Funny, when I saw the title of this talk, I thought elBulli sounded familiar. Of course, it was the famous restaurant near Barcelona!

How did elBulli follow the four steps of knowledge creation? First through socialization: courses, conferences, fairs. Then externalization: catalogues, recipes, multimedia resources. Next, combination: techniques + products + recipes = new dishes. Last, internalization: learning to cook by practicing.

He developed Bullipedia. The restaurant closed, but started a foundation to support research on cuisine and cooking. What is clear is that “there is not a clear codification on cuisine.”

For his work, he’s using Nonaka and Takeuchi’s theoretical model of knowledge creation. Rather than employing F2F socialization, he now uses modern Web tools to bring masses of users together. This clearly points toward crowdsourcing as a tool, but that comes with a variety of challenges, of course, such as evaluating contributors.

Friday

Introducing Anvil Academic: Developing Publishing Models for the Digital Humanities
Fred Moody, Anvil Academic

Trying to address the gap between digital scholarship and what he called “analog metrics,” i.e.- the typical reward systems in the current career world. Alas, they’ve already seen people back out from publishing to go down the legacy publishing channels, such as a typical scholarly monograph.

In general, they are looking to publish material that cannot be fully expressed in print. As he noted, pretty much all of the work presented at this conference is not suited for that monograph form, so clearly there’s a body of work that needs an outlet. Such work needs a publisher, so that’s the niche they seek to fill. They have the ambition to become a known publisher that can provide authors with the status they need for tenure.

Asked the question: “what is publishing?”

peer review
editorial services
distribution
impact metrics
imprimatur
cataloging and preservation

Their criteria are fluid, since he noted that the work is varied in nature, but in general they want to see a scholarly contribution and a solid rationale for being a digital work (i.e.- the work needs to be presented digitally, not just a flourish). They do seek a balance between “traditional” scholarly content and the digital medium.

Interestingly, they don’t publish per se. They want to display works in their “native environments.” They apply their imprint, list it in their catalogue, but don’t require authors to republish in a new environment. This would be fairly impractical with this type of scholarship. In other words: no hosting. This does mean that the production burden is on the author(s).

eBook as Ecosystem of Digital Scholarship
Christopher Long, Penn State University

Spent the first few minutes discussing Socratic ideas. His basic research idea was to see if he could conduct serious academic work in public media (I think I really butchered this down, but that’s how it sits in my head). He calls this The Digital Dialogue, a podcast series where he invited various individuals to share their ideas and provided commentary. Tagline for the project is “cultivating the excellence of dialogue in the digital age.”

He then turned toward Platonic writing. As his slide put it, “Plato writes for readers” and “Platonic writing is a political art.” Written texts require engagement, as he explained. So his question became whether a book can perform this political act of dialogue, so he’s trying to publish a book that encourages public reading by allowing annotations which feed his blog and allow him to respond.

One thing I admire about his talk and project is that he clearly is investing a great deal of his own time and energy into kickstarting conversations and dialogue into life. Collaborative reading clearly takes work, so while one could see this as, well, something of a vanity project (since it places him at the crux of the dialogue), it’s hard to fault him for that since it’s his energy and time he’s investing. Someone asked if he’s worried about the time it will take, and his response was “I’m very hopeful I’ll have that problem.” Exactly!

Joint and Multi-authored Publication Patterns in the Digital Humanities
Julianne Nyhen, University College London

This grew out of her other work on the history of digital humanities. In general, she’s curious to see if the publication record bears out the notion that DH is inherently collaborative. Used Zotero to pull bibliographic data from two DH journals (CHUM and LLC), and did analysis on the results. Used another journal as a sort of control, where it was shown that single authorship holds steady over time.

Impossible to take detailed notes, as she was blowing through statistics at warp speed. Still, interesting to see that there is evidence that multiple authorship is rising over time to some degree, although that’s a simplistic statement given the numbers she presented.

Identifying the Real-time Impact of the Digital Humanities Using Social Media Measures
Hamed Alhoori, Texas A&M University

Started with the question of how we are to cope with the onslaught of social media data. Question that arises: how can one use social media to get a sense of what research is catching the attention of the wider world. Typically, one measures impact in citations, or by using crass metrics such as views, downloads, bookmarks, etc. It’s hard to assign intentionality to such numbers, though.

Other tools come closer because they indicate readership: Zotero, Mendeley, etc. The used this type of data to test readership against Google’s citation counts. What they found in their work was that the humanities were missing, to some degree. They used something they called the research community article rating (RCAR), and pulled data from Mendeley. What they found is that citation is higher than readership when Mendeley data is pitted against Google citations. They then compared digital humanities against other fields, and found that readership exceeds citations.

They went further and compared citation to altmetrics (tweets, FB mentions, etc.). They found that some articles have no citations, but altmetrics show impact.

Loved this talk, and will have to seek out the articles behind it to get more details. They seem to be scratching the surface of how to assess published scholarship beyond citations.

Visualizing Centuries: Data Visualization and the Comédie-Francaise Registers Project
Jason Lipshin, Massachusetts Institute of Technology

CFRP provides digitized access to rare and important documents on French theatre. They stem from a time of great upheaval in France and could shed light on important questions.

Simple notion: visualization as a form of machine reading; one can toggle between the macro and the micro.

The items they’re focusing on are daily box office receipts that show title, tickets sold, etc. Quite a large collection. Generated several kinds of visualizations, including a theatre

Created a browser-based tool to support the visualizations.

ChartEx: A Project to Extract Information from the Content of Medieval Charters and Create a Virtual Workbench for Historians to Work with this Information
Helen Petrie, Sarah Rees Jones, University of York, UK
ChartEx site

Charters are legal documents noting the transfer of ownership of a piece of property. More interesting than the legal matters are the relationships that these documents reveal.

The issue is that when one moves from reading one charter to reading thousands, how does one sort and arrange the charters and array them spatially. The descriptions are often natural language and there’s little regularity in description.

Challenge: how to automate, or semi-automate, such a boring and tedious process. They want computers to interpret these vague place descriptions (natural language processing) and map them to actual places. After this NLP process it becomes a data mining process, where one challenge is to identify people between charters (again, there are variant forms of names, etc.). This requires probabilistic reasoning methods. Beyond that, site matching is also a DM challenge.

Beyond just allowing historians to see the data, they aspire to create a system where they can actually work with the data: reasoning, annotations, sharing. In other words, do some things they can do in an analogue setting, but go beyond that and offer more powerful tools.

Her approach is to find out what people want do do with such documents, rather than just focusing on the content. They used contextual inquiry, where one observes a process and asks the subject to walk and talk the researcher through their thought and work processes. It was difficult for historians to reflect on what they’re doing, but they are “getting there” in her words. This kind of HCI approach (she joked that H is for historian!) seems wise given that one wants to create tools that have potential broad appeal because they enable the right kind of interactions and tools. They create PowerPoint prototypes, etc., which seems very familiar to anyone who has created user interfaces.

Dyadic Pulsations as a Signature of Sustainability in Correspondence Networks
Frédéric Kaplan, Ecole Polytechnique Federale de Lausanne, Switzerland

Patterns are a big deal in DH research these days. Now everyone is looking for signatures in these correspondences, and it turns out that reply time between two exchanges are a significant variable. These delays point toward prioritization strategies, among other possibilities. Their hypothesis is that response time indicates the general ‘health’ of a correspondence network.

A sudden spike in response time often indicates the end of a discussion. They can observe this with usenet discussion groups (I think that was their data source). Question: can this be predicted before it occurs? One of the authors came up with the notion of dyadic pulsation.

In a series of slides, he walked through the notion of pulsation. Difficult to capture in notes. Visually, it made a whole ton of sense, so one needs to see the graphs.

He pointed out that for dyadic pulsations to be useful, they need to be free from spam, which is common in discussion groups. Made a joke that we don’t respond to spam, but if we do, it’s de facto not spam. So a dyadic pulsation is inherently spam free because it requires a response.

Went into an interesting discussion about a relative degree of spamness. A lack of dyadic pulsations can indicate the actual contributors are being treated as spam, i.e.- their messages to groups are routinely ignored and become de facto spam even if they are real people, not scripts or bots.

Sidenote: he had effective and attractive slides. Visually appealing, huge text, supportive of his words without being repetitive. He really understood the nature of the short paper format and nailed it.

Using the Social Web to Explore the Online Discourse and Memory of the Civil War
Simon Appleford, Clemson University, US

Pointed that that given the torrent of conversations in social media, it can be challenging to winnow it down the useful information one wants to study. Most people are concerned about how to get more eyeballs, but academics are more interested in what he described as the long tail of social media, i.e.- the narrower-band conversations.

They are engaged in developing tools that help with gathering this kind of data, but searching, say, tweets using keywords isn’t a useful exercise because any keyword generates false drops. Their tool in development will allow researchers to take data they’ve pulled from a social network (mainly, Twitter), upload it into the system which will convert and index the data, and enable analysis. Tool is known as the Social Web Analytics Tool and will be open source. Beta will be available in three months or so.

Ended by talking about curation and scale issues. It moves so quickly that archiving and sharing become complicated matters. Sure, the Library of Congress is archiving Twitter, but what does that mean for research, and don’t we need more (his question).

Expanding and Connecting the Annotation Tool ELAN
Han Sloetjes, The Language Archive — Max Planck Institute for Psycholinguistics

Language Archive is part of the Max Planck Institute for Psycholinguistics (only MPG site in The Netherlands). ELAN = Elan Linguistic ANnotator. It is a tool for manual annotation of multimedia. Open source, written in Java, available for Win, Mac, and Linux. Main user populations are language documentation, sign language research, gesture research, multimodality research, etc.

Essentially a toolkit that displays a video and offers a workspace that permits annotation. Various modes: annotation, synchronization, segmentation, and transcription. Now engaged in extending the tool by allowing other Web services to be called, such as WebLicht and WebMAUS.

“Shall These Bits Live?” Towards a Digital Forensics Research Agenda for Digital Humanities with the BitCurator Project
Matthew Kirschenbaum, University of Maryland; Alex Chassanoff, UNC Chapel Hill

Emphasized that while it’s a terrible name, to some degree we’re working with the digitized humanities. These days it’s also a given that archives and libraries are now ingesting born digital materials.

Digital forensics emerges from, as he put it, darker purposes such as policing and surveillance, but it’s not a reach to assert that there is a great deal of benefit for cultural heritage institutions in these methods. Disk images are critical, i.e.- a bitstream copy of the original media. He referred to these as the gold standard for born-digital archives.

Made a point to mention screen essentialism, which is being focused only on what’s on the screen. For archives, it’s far more compelling to have the original bitstream since it reveals potentially interesting information that does not appear on the screen.

BitCurator is an open-source digital forensics software system. Jointly developed by UNC-Chapel Hill and MITH at the University of Maryland. Mellon funded. The objective is simple: ensure the integrity and authenticity of born-digital materials (as well as their provenance and trustworthiness). BitCurator generates an XML-formatted PREMIS file containing all of the preservation actions taken on a file.

She ran through a list of reasons why digital archivists should collaborate with digital humanists. Long list to detailed to record in notes, but in essence, there are common interests, not least in terms of making clear what needs to be preserved. This also points the way toward a number of future collaborations between digital forensics and digital humanities.

Surrogacy and Image Error: Transformations in the Value of Digitized Books
Paul Conway, University of Michigan

His research concerns the incidence of error in HathiTrust. Part of his talk concerns issues that he’s already published on, but the latter part covers issues around surrogacy, which is sort of the new ground here.

97% of HathiTrust content comes from Google Books, 2% from the Internet Archive, and about 1% from other HT members. They are well beyond 10 million volumes, of which about a third are in the public domain. Lawsuits have gummed up having more in the PD.

He made the simple but clear point that the methods use by Google are far removed from what we know in cultural institutions. As he noted, this work has triggered something of a moral panic, with a lot of commentary in the blogosphere and from other sources.

They created an error model, where they classify various errors and identify possible causes. They also established an error severity scale, from error undetectable (0) to error renders object indecipherable (5).

He’s really not being very kind about Google’s methods. Not unfair, really, but very up front about why they’re doing it (for OCR to enhance search) and how that makes the product less than ideal. Ilustrations, for example, matter little to Google so show significant errors according to his research. Then there are myriad errors we all know: missing pages, fingers on the page, poor cropping, warped text, etc. He didn’t spare the Internet Archive, either. He noted that the colouring they add to pages in post-processing (to make books look old) creates bad problems when it goes awry. Their use of glass platen scanners means no fingers, but introduces inevitable blurring issues. This was a fun tour of the horrors of mass digitization. Might help explain why Google Books just doesn’t fire the public imagination as it used to.

He ran through some interesting phenomena that he described as “resistance to homogenization,” including resistance from what he called industrialized information workers, i.e.- the Google employees doing the scanning who lack benefits. They tend, for example, to put their hands over foldouts or end papers, objects where OCR doesn’t matter and that Google deems unimportant.

Surrogacy becomes an interesting topic. As Terras puts it, how can we trust the texts that are produced so carelessly (*badly paraphrased*)?

A massive team has been doing this work (45), funded by an IMLS grant.