Extract PDF text in your browser with LiteParse for the web

Simon Willison's Weblog

View Original ↗
AI 導讀 technology AI 重要性 3/5

59 分鐘 vibe coding、0 行程式碼親自看過,Simon Willison 把 LiteParse PDF 解析工具移植進瀏覽器

  • LiteParse 用啟發式算法解決 PDF 多欄排版問題,完全不依賴 AI 模型,速度快且可離線
  • Claude Code 59 分鐘完成移植,Simon 本人連一行 HTML 或 TypeScript 都沒看過
  • 工具完全在瀏覽器端執行不對外傳資料,GitHub Pages 免費部署,安全與成本雙無憂

59 分鐘、0 行程式碼親自審閱——Simon Willison 用 Claude Code(Anthropic 的 AI 程式設計助手)把 LlamaIndex 的 PDF 解析工具 LiteParse 從 Node.js CLI 搬進純瀏覽器環境,連一行 HTML 或 TypeScript 都沒有看過,最終部署在 GitHub Pages 免費上線。整個過程他形容是「最純粹的 vibe coding」,卻又認為這個案例在技術判斷上是合理且可推薦的。

LiteParse 的空間文字解析技術原理

LiteParse 是 LlamaIndex(開源 AI 應用框架公司)推出的開源 PDF 文字擷取工具,以 Node.js CLI(命令列介面)形式提供。它最大的特點是完全不依賴 AI 模型:採用傳統 PDF 解析技術,遇到影像型 PDF(掃描件)才退回使用 Tesseract OCR(開源光學字元辨識引擎)處理。

它要解決的核心難題叫「spatial text parsing(空間文字解析)」。PDF 格式本身不保證文字的儲存順序和視覺閱讀順序一致,雙欄排版、表格、浮動圖注等情況下,直接讀取 PDF 的文字流可能毫無意義。LiteParse 用啟發式算法(heuristics,依規則判斷而非機器學習)偵測多欄排版,把文字重新排列成線性、符合閱讀習慣的流序。

文件中另一個值得注意的功能是「Visual Citations with Bounding Boxes(帶邊界框的視覺引用)」:在 RAG(retrieval-augmented generation,檢索增強生成,讓 AI 根據文件回答問題)情境下,回答問題時可附上原 PDF 頁面截圖和標示框,提升答案可信度。LiteParse 底層依賴 PDF.jsTesseract.js 兩個 JavaScript 函式庫,而這兩個庫本就可以在瀏覽器環境運行——這正是後來移植工作的技術基礎。

從 iPhone 到 Claude Code:59 分鐘完成瀏覽器移植

Simon 最初是在 iPhone 上使用 Claude 網頁介面嘗試 LiteParse,上傳一份 PDF 後請 Claude 直接 clone GitHub 倉庫並試跑。他隨後問了一個關鍵問題:「這個函式庫能在瀏覽器裡跑嗎?」Claude 的回答讓他確信技術上可行,而 LiteParse 之所以還沒有瀏覽器版,僅僅是因為「還沒有人做」。

Simon 隨即打開筆電,切換到 Claude Code,fork 原倉庫、建立新 branch,把 Claude 的研究結果存成 notes.md,然後告訴它:「把這件事做成一個網頁 app,先把詳細實作計畫寫成 plan.md。」他習慣在這類專案中要求 Claude 先輸出 plan 文件,方便後續討論與修改——例如 Claude 原本計畫把 PDF 截圖功能延到 v2,他直接下指令要求 v1 就做完。

計畫確認後,他說了句「build it」,然後去做別的事、刷 Duolingo,偶爾回來補充需求。Claude Code 從這個指令到完成,總共花了 59 分鐘。為了驗證 Claude 沒有偷懶——例如把關鍵功能標成 TODO 或假裝完成——Simon 用 GPT-5.5(OpenAI,他有搶先使用權)要它描述 Node.js CLI 版與瀏覽器版的技術差異,得到詳細且正確的對比後才放心。

TDD、Playwright 與 Safari Bug:開發過程的幾個細節

在對 Claude Code 的一系列指令中,Simon 要求採用 Playwright(微軟推出的端對端瀏覽器測試框架)進行紅綠 TDD(test-driven development,先寫測試再寫實作,測試先失敗才算有效)。他也要求「small commits along the way(每個步驟各自一個 commit)」,認為這有助於讓 AI 一次聚焦一個問題,也讓事後審閱更容易。

開發過程出現了一個典型的跨瀏覽器問題:Chrome 和 Firefox 正常,Safari 出現「Parse failed: undefined is not a function」。Simon 把錯誤訊息直接貼給 Claude Code 並點出是 Safari,Claude 快速定位並修復。UI 細節方面,他直接截了一張 Firefox 上長檔名破版的截圖貼給 Claude,讓它看圖修 bug——他形容這個做法「效果出乎意料地好」。

開發伺服器部分,他開了另一個 Claude Code session 詢問怎麼即時預覽,被告知用 npx vite 啟動本地開發伺服器。部署流程由第三個獨立的 Claude Code session 處理,設定 GitHub Actions:每次 push 先跑 Playwright 測試,通過後用 Vite(前端建置工具)打包並自動部署到 GitHub Pages,整個流程零費用。

Vibe Coding 的邊界:Simon 自己怎麼看這個案例

Simon Willison 對「vibe coding」有嚴格的個人定義:不是指「用 AI 輔助寫程式」,而是指「完全不看、不在乎 AI 寫出什麼程式碼」。按他的標準,LiteParse 瀏覽器版是他做過最純粹的 vibe coding——寫這篇文章時才去確認 Claude 到底用了 JavaScript 還是 TypeScript。

他認為這個案例是「安全的 vibe coding」,理由有三:第一,靜態瀏覽器端工具的「blast radius(爆炸半徑)」幾乎是零——出 bug 的後果頂多是解析某份 PDF 失敗,不影響任何伺服器或用戶資料。第二,所有 PDF 處理都在本地瀏覽器執行,他查看 network panel 確認解析過程中沒有任何額外外部請求,安全審查因此不必要。第三,這個案例仍然需要工程判斷:識別出 LiteParse 可以被移植、選對函式庫組合、決定用 TDD 控制品質,這些都是有意識的技術決策。

他目前尚未對原 LiteParse 倉庫發 PR,已開了一個 issue,歡迎原團隊取用這個瀏覽器版作為官方功能的起點。

把靜態頁面、瀏覽器端執行、GitHub Pages 疊在一起,vibe coding 的爆炸半徑可以小到近乎無害——這才是讓人放心推薦的結構性前提。

Abstract

LlamaIndex have a most excellent open source project called LiteParse, which provides a Node.js CLI tool for extracting text from PDFs. I got a version of LiteParse working entirely in the browser, using most of the same libraries that LiteParse uses to run in Node.js. Spatial text parsing Refreshingly, LiteParse doesn't use AI models to do what it does: it's good old-fashioned PDF parsing, falling back to Tesseract OCR (or other pluggable OCR engines) for PDFs that contain images of text rather than the text itself. The hard problem that LiteParse solves is extracting text in a sensible order despite the infuriating vagaries of PDF layouts. They describe this as "spatial text parsing" - they use some very clever heuristics to detect things like multi-column layouts and group and return the text in a sensible linear flow. The LiteParse documentation describes a pattern for implementing Visual Citations with Bounding Boxes. I really like this idea: being able to answer questions from a PDF and accompany those answers with cropped, highlighted images feels like a great way of increasing the credibility of answers from RAG-style Q&A. LiteParse is provided as a pure CLI tool, designed to be used by agents. You run it like this: npm i -g @llamaindex/liteparse lit parse document.pdf I explored its capabilities with Claude and quickly determined that there was no real reason it had to stay a CLI app: it's built on top of PDF.js and Tesseract.js, two libraries I've used for something similar in a browser in the past. The only reason LiteParse didn't have a pure browser-based version is that nobody had built one yet... Introducing LiteParse for the web Visit https://simonw.github.io/liteparse/ to try out LiteParse against any PDF file, running entirely in your browser. Here's what that looks like: The tool can work with or without running OCR, and can optionally display images for every page in the PDF further down the page. Building it with Claude Code and Opus 4.7 The process of building this started in the regular Claude app on my iPhone. I wanted to try out LiteParse myself, so I started by uploading a random PDF I happened to have on my phone along with this prompt: Clone https://github.com/run-llama/liteparse and try it against this file Regular Claude chat can clone directly from GitHub these days, and while by default it can't access most of the internet from its container it can also install packages from PyPI and npm. I often use this to try out new pieces of open source software on my phone - it's a quick way to exercise something without having to sit down with my laptop. You can follow my full conversation in this shared Claude transcript. I asked a few follow-up questions about how it worked, and then asked: Does this library run in a browser? Could it? This gave me a thorough enough answer that I was convinced it was worth trying getting that to work for real. I opened up my laptop and switched to Claude Code. I forked the original repo on GitHub, cloned a local copy, started a new web branch and pasted that last reply from Claude into a new file called notes.md. Then I told Claude Code: Get this working as a web app. index.html, when loaded, should render an app that lets users open a PDF in their browser and select OCR or non-OCR mode and have this run. Read notes.md for initial research on this problem, then write out plan.md with your detailed implementation plan I always like to start with a plan for this kind of project. Sometimes I'll use Claude's "planning mode", but in this case I knew I'd want the plan as an artifact in the repository so I told it to write plan.md directly. This also means I can iterate on the plan with Claude. I noticed that Claude had decided to punt on generating screenshots of images in the PDF, and suggested we defer a "canvas-encode swap" to v2. I fixed that by prompting: Update the plan to say we WILL do the canvas-encode swap so the screenshots thing works After a few short follow-up prompts, here's the plan.md I thought was strong enough to implement. I prompted: build it. And then mostly left Claude Code to its own devices, tinkered with some other projects, caught up on Duolingo and occasionally checked in to see how it was doing. I added a few prompts to the queue as I was working. Those don't yet show up in my exported transcript, but it turns out running rg queue-operation --no-filename | grep enqueue | jq -r '.content' in the relevant ~/.claude/projects/ folder extracts them. Here are the key follow-up prompts with some notes: When you implement this use playwright and red/green TDD, plan that too - I've written more about red/green TDD here. let's use PDF.js's own renderer (it was messing around with pdfium) The final UI should include both the text and the pretty-printed JSON output, both of those in textareas and both with copy-to-clipboard buttons - it should also be mobile friendly - I had a new idea for how the UI should work small commits along the way - see below Make sure the index.html page includes a link back to https://github.com/run-llama/liteparse near the top of the page - it's important to credit your dependencies in a project like this! View on GitHub → is bad copy because that's not the repo with this web app in, it's the web app for the underlying LiteParse library Run OCR should be unchecked by default When I try to parse a PDF in my browser I see 'Parse failed: undefined is not a function (near '...value of readableStream...') - it was testing with Playwright in Chrome, turned out there was a bug in Safari ... oh that is in safari but it works in chrome When "Copy" is clicked the text should change to "Copied!" for 1.5s [Image #1] Style the file input so that long filenames don't break things on Firefox like this - in fact add one of those drag-drop zone UIs which you can also click to select a file - dropping screenshots in of small UI glitches works surprisingly well Tweak the drop zone such that the text is vertically centered, right now it is a bit closer to the top it breaks in Safari on macOS, works in both Chrome and Firefox. On Safari I see "Parse failed: undefined is not a function (near '...value of readableStream...')" after I click the Parse button, when OCR is not checked - it still wasn't working in Safari... works in safari now - but it fixed it pretty quickly once I pointed that out and it got Playwright working with that browser I've started habitually asking for "small commits along the way" because it makes for code that's easier to understand or review later on, and I have an unproven hunch that it helps the agent work more effectively too - it's yet another encouragement towards planning and taking on one problem at a time. While it was working I decided it would be nice to be able to interact with an in-progress version. I asked a separate Claude Code session against the same directory for tips on how to run it, and it told me to use npx vite. Running that started a development server with live-reloading, which meant I could instantly see the effect of each change it made on disk - and prompt with further requests for tweaks and fixes. Towards the end I decided it was going to be good enough to publish. I started a fresh Claude Code instance and told it: Look at the web/ folder - set up GitHub actions for this repo such that any push runs the tests, and if the tests pass it then does a GitHub Pages deploy of the built vite app such that the web/index.html page is the index.html page for the thing that is deployed and it works on GitHub Pages After a bit more iteration here's the GitHub Actions workflow that builds the app using Vite and deploys the result to https://simonw.github.io/liteparse/. I love GitHub Pages for this kind of thing because it can be quickly configured (by Claude, in this case) to turn any repository into a deployed web-app, at zero cost and with whatever build step is necessary. It even works against private repos, if you don't mind your only security being a secret URL. With this kind of project there's always a major risk that the model might "cheat" - mark key features as "TODO" and fake them, or take shortcuts that ignore the initial requirements. The responsible way to prevent this is to review all of the code... but this wasn't intended as that kind of project, so instead I fired up OpenAI Codex with GPT-5.5 (I had preview access) and told it: Describe the difference between how the node.js CLI tool runs and how the web/ version runs The answer I got back was enough to give me confidence that Claude hadn't taken any project-threatening shortcuts. ... and that was about it. Total time in Claude Code for that "build it" step was 59 minutes. I used my claude-code-transcripts tool to export a readable version of the full transcript which you can view here, albeit without those additional queued prompts (here's my issue to fix that). Is this even vibe coding any more? I'm a pedantic stickler when it comes to the original definition of vibe coding - vibe coding does not mean any time you use AI to help you write code, it's when you use AI without reviewing or caring about the code that's written at all. By my own definition, this LiteParse for the web project is about as pure vibe coding as you can get! I have not looked at a single line of the HTML and TypeScript written for this project - in fact while writing this sentence I had to go and check if it had used JavaScript or TypeScript. Yet somehow this one doesn't feel as vibe coded to me as many of my other vibe coded projects: As a static in-browser web application hosted on GitHub Pages the blast radius for any bugs is almost non-existent: it either works for your PDF or doesn't. No private data is transferred anywhere - all processing happens in your browser - so a security audit is unnecessary. I've glanced once at the network panel while it's running and no additional requests are made when a PDF is being parsed. There was still a whole lot of engineering experience and knowledge required to use the models in this way. Identifying that porting LiteParse to run directly in a browser was critical to the rest of the project. Most importantly, I'm happy to attach my reputation to this project and recommend that other people try it out. Unlike most of my vibe coded tools I'm not convinced that spending significant additional engineering time on this would have resulted in a meaningfully better initial release. It's fine as it is! I haven't opened a PR against the origin repository because I've not discussed it with the LiteParse team. I've opened an issue, and if they want my vibe coded implementation as a starting point for something more official they're welcome to take it. Tags: javascript, ocr, pdf, projects, ai, generative-ai, llms, vibe-coding, coding-agents, claude-code, agentic-engineering