Teaching Computers to Read: Dataset Curation Impact on Model Performance

Date & Time

Jan. 15, 2026, 8:30 p.m. - Jan. 15, 2026, 10 p.m.

Cost

$0

Location

Online


Sign Up


Description

Learn how curated datasets boost real-world NLP gains with practical guidelines, templates, and examples for data scientists building reliable AI.

 

Details

Workshop Summary: Successful AI solutions aren’t about chasing the newest model - it’s about solving the right problems in the right way. The book “Teaching Computers to Read” (out November 5 from CRC Press) focuses on what technical teams need to design, develop, deploy, and maintain useful NLP and AI solutions. Drawing on real-world experience and examples, the book offers actionable best practices to deliver adaptable, reliable AI systems that address business challenges with lasting, tangible value. In this tutorial, we will walk through one part of the Code Companion for the book. We will review the corpus distribution and variation, our annotated data distribution, and explore how our curated datasets impact the performance of different technical approaches, using information extraction as an example. The concepts covered in the tutorial are covered in more detail in the book, and there are additional exercises in the Code Companion for those interested in going beyond the tutorial session.

Prerequisites: To follow along, the prerequisites include cloning the repo from the "Teaching Computers to Read" Code Companion. Follow the 3 steps in the "Setup" section to clone the repo, create a virtual environment, install requirements, and download the relevant files (linked in the Readme to a HuggingFace dataset).

Bio: Rachel Wagner-Kaiser has 15 years of experience in data and AI, entering the data science field after completing her PhD in astronomy. She specializes in building NLP solutions for real-world problems constrained by limited or messy data. Rachel leads technical teams to design, build, deploy, and maintain NLP solutions, and her expertise has helped companies organize and decode their unstructured data to solve a variety of business problems and drive value through automation. Rachel is also the author of the book "Teaching Computers to Read".

WebsitePersonal
LinkedInLinkedIn