Information Extraction from Natural Document Formats with David Rosenberg - TWiML Talk #126
Research#llm📝 Blog|Analyzed: Dec 29, 2025 08:28•
Published: Apr 9, 2018 17:23
•1 min read
•Practical AIAnalysis
This article discusses a podcast episode featuring David Rosenberg, a data scientist at Bloomberg, focusing on their work in extracting data from unstructured financial documents like PDFs. The core of the discussion revolves around a deep learning pipeline developed to efficiently extract data from tables and charts. The article highlights key aspects of the project, including the construction of the pipeline, the sourcing of training data, the use of LaTeX as an intermediate representation, and the optimization for pixel-perfect accuracy. The article suggests the episode provides valuable insights into practical applications of deep learning in information extraction within the financial industry.
Key Takeaways
- •Bloomberg uses a deep learning pipeline for information extraction from financial documents.
- •The pipeline extracts data from tables and charts in PDF and other unstructured formats.
- •The project involves training data sourcing, LaTeX as an intermediate representation, and pixel-perfect accuracy optimization.
Reference / Citation
View Original"Bloomberg is dealing with tons of financial and company data in pdfs and other unstructured document formats on a daily basis."