The rapid growth of scanned-work digital libraries presents a new opportunity for learning more about our collections. With digital access to text inside the books of a collection, content-based text mining methods can be leveraged to learn more about the relationships between works, helping correct inaccurate metadata, suggest classification information, recommend similar works, and label the nature of links between works.
This talk will introduce the Similarities and Duplication in Digital Libraries project, SADDL, a project identifying same-work relationships among the 17 million works seen in the HathiTrust Digital Library. SaDDL is identifying exact duplicates as well as traditionally difficult-to-identify relationships such as derivatives, different editions, abridgments, and whole or part relationships. We present the challenges of the problem, our project's approach to meeting them, and a new dataset for cataloguers and scholars to apply our outcomes.