Show HN: 인간이 지식에 접근할 수 있도록 하여 AI를 물리치기
hackernews
|
|
📰 뉴스
#claude
#오픈소스
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
인간의 지식 접근성을 높여 AI에 대응하기 위해 오프라인 사전 웹앱 및 소수 언어 사전 구축 도구를 개발하는 프로젝트가 소개되었습니다. 현재 캄보디아어, 영어, 라오스어 등 다양한 언어를 지원하며, 사용자가 검색창에 입력하면 활성화된 사전에서 통합 검색 결과를 제공합니다. 또한 텍스트 선택 및 음성 변환 기능을 지원하여 영어를 캄보디아어 억양으로 읽어주는 등 흥미로운 기능을 제공합니다.
본문
Defeating AI by making knowledge accessible to Humans. See it live. A project to make offline (PWA) dictionary webapps and tools for building those for obscure languages. Current languages: - Khmer (Cambodian) - English (This is mainly so that we can load English resources to combine with other dictionaries.) - Lao (Laos) - Chitonga (Tonga) - Lozi (Silozi) - German - Spanish - Indonesian - Levantine (Lebanese Arabic) For the fullest example, try PeakSlab.org/khmer and enter something in the search bar. It'll search as you type pulling in definitions from all the dictionaries that you had enabled on the page. If you want to change the dictionaries loaded, simply clear the search bar and the Dictionary list will appear and you can enable and disable them individually. When you're searching something, e.g. Gen 7 and you want to look up a word without losing your spot, select the word and push the popup button. If your system supports text to speech you'll also see options for that. The funnest thing you can do is select English text and have the Khmer TTS read it in a Khmer accent. The Khmer dictionary page is really the prototype and the reason for every other dictionary page. The files will be cached to your device, automatically, but since they're compressed it's not too bad. The 20 or so dictionaries included in the Khmer page are about 53mb all together. (26mb of that is the Kora Praise one because that has 1500 sheet music scans in it. I did not compress that file because it's negligible savings and the file is so big any way) People go through great effort to gather proper data for AI to learn, so my question is, why don't we make knowledge and data accessible to humans instead so that we can learn? I'm sick of people using AI as a dictionary, it's slow, internet dependent, prone to hallucinations, and untrustworthy. The only advantage AI has versus us is more data and better ways to access it, so let's remedy that. For these tests I ran my laptop connected to my Phone's hotspot to serve the page. Files are cached to my Moto G Power 2024 running Brave and the page is refreshed at > 5 second intervals to measure load time from cache. | Format | Loadtime | Speed | Filesize | |---|---|---|---| | SQLite3 | 789ms | 1.0x | 84mb | | .peak | 481ms | 1.64x | 49mb | | .peak split | 391ms | 2.02x | 58mb | | .peak split (dual worker) | 380ms | 2.08x | 58mb | | .peak.zst | 712ms | 1.11x | 9.3mb | | .peak.zst split | 570ms | 1.38x | 11mb | | .peak.zst split (dual worker) | 479ms | 1.65x | 11mb | | Format | File size | Percentage | |---|---|---| | .tsv (src file) | 52mb | 100% | | SQLite3 | 84mb | 162% | | .peak | 49mb | 94% | | .tsv (split) | 60mb | 115% | | .peak (split) | 58mb | 112% | | Format | File size | Percentage | |---|---|---| | .tsv.zst | 7.9mb | 15% | | SQLite3.zst | 14mb | 27% | | .peak.zst | 9.3mb | 18% | | .tsv.zst (split) | 9.1mb | 17% | | .peak.zst (split) | 11mb | 21% | | Program | Core | Glue | App HTML .js | Total | |---|---|---|---|---| | PeakSlab SQLite3 | 851kb | 391kb | 32kb | 1269kb | | PeakSlab PeakSlab | 37kb (4%) | 9kb (2%) | 39kb (122%) | 85kb (7%) | The SQLite3 version is the old version of PeakSlab before I wrote the custom file format. The advantages of the custom format are smaller file sizes, instant loading (cast to a struct), and versatile indexes. The reason that .peak slabs are smaller than .tsv files is because peak removes all capitalization and HTML tags and puts them in a tags (or dictionary) section to be reinserted on render. As you can see the runtime is drastically smaller, the files are smaller, and the load speed is faster even with decompressing the files on every load. Loading uncompressed files is 1.64x faster or 2x faster if the files are split (even though the split files take up more space than the one). This project is under the GPL3 license. This project uses the following libraries: - Zstandard (BSD License. See ZSTD_LICENSE.txt) - Stringzilla (Apache 2.0 License. See STRINGZILLA_LICENSE.txt) - Client Side (Offline, power to the user) - Modular (Can load and run many different dictionary files in parallel) - Scalable (Same as above) - Lightweight (Written from scratch) - Fast (Loading and searching) - Libre Open Source (GPL3) - Simple (You just edit the source tsv file and then use peakgen to turn it into an indexed peak file. Or give it a full directory and it will generate a slab file with all the files in that folder). Each line is already it's own index item, but if you put an '@' anywhere it'll put everything after that as an item in the secondary index. '^' for tertiary index. Duplicate the '@' or '^' to escape them. To load a peak file we literally just cast the raw data to a struct, works great, this is why we write in C. - Sane Defaults (Most relevant results first, fallback to less relevant) - Powerful This is a custom format built to be very fast to load (cast to a c-struct and done) and very fast to search with 3 binary search indexes built in. It's very similar to a TSV file and is generated from them. A Peak file is not a database, there's no transactions or inserts or writes, just reads, as it should be for ultimate speed and simplicity. These peak files can then be compressed using zstdandard compression which is very quick for decompressing and has a good compression ratio. There's an online version of the PeakSlab Generator because I hate when a dictionary converter stops working or has 100 dependencies and you can't compile it any more without rewriting it. (Only works for Peak files at the moment). You want to quickly set up a new language? Add a language A Slab file is like a peak file except instead of the data being text it's binaries. That allows for storing lots of little files with searchable headers that can seamlessly be integrated into the results of a search. - tsv file - WEBP images - WEBM Opus audio - JBIG2 images via custom wasm decoder (27kb) Adding support for other filetypes is trivial, but for right now I just have the most efficient and easy to use formats. - System TTS integration - Narrow and wide search - Offline - Selection Menu - Online Peak Generator from .tsv source - No API, no POSTS, not scrapable by AI because it's all run clientside through javascript and wasm. (But if the AI was smart it'd just read the provided source tsv files.) - No frameworks, no React, no npm, no jQuery, no typescript, just vanilla javascript and C code. - Regex or Glob support - Add a Codec2 audio decoder so we can get even smaller audio files. - Add chunking and http ranges to the service worker's download logic so we can resume and track big downloads. - Add search parameters so that we can quickly share searches. Expand Exact Search to work with 2nd index too.- make an online editor media support in dictionaryignore zero width spaces in search- History and bookmarking Selection to TTS- Sheet music (ABC files) - Remove javascript glue code for peak.wasm (peak.js). JBIG2 image supportCite sources- Allow users to upload their own custom PeakSlab files which will stay cached in Indexeddb. Rework databases- Rewrite the AI's service worker. Make it more modular to make porting languages and data easier.Rewrite the AI's rust code in CBundle zstd compressor with peakgen.Make an online peakgen.- Add .slab support to online peakgen. Fix strcmp bugs.Fix context menu.- Make a custom regex-like language for substitution and character unfolding. - Make it so that files to be included in the slab file can have a metadata file so that there can be attribution or alttext attached to the file. Make the combiner combine the entries with the same headword in order- Enable custom html for the combining of dictionary entries. - Added JBIG2 image support. - Rewrote the interface to be more intuitive and simpler. - Moved from SQLite's wasm backend to a brand new engine and file format. This format allows for really good compression and lightning fast speed as well as speed and lazy loading. - Be me, a missionary in Cambodia. All the Khmer dictionary apps are full of ads, or require internet connection or just incomplete. So I decide to make my own Khmer dictionary modules for Aard. The process is messy and it's difficult to share with other people. There's no Aard dictionary app on iOS. - Tried Stardict and other things, a lot of the programs were outdated and just didn't work anymore; so I decided to make my own. - Tried SQLite, it worked pretty good. But the database files were too large and the runtime was too bloated. Editing databases was a pain. Left join right join all join? I figured out that github pages would send a compressed form if I saved the database file with a .html extension. Still downloads really slowly on iOS. Decide that I don't need all the features that SQLite offers, I just need to be able to read from the database. Also wanted the ability to remove tags and such from search without having duplicated data. - Started using Grok to help me prototype a lot of ideas. - Tried Pouchdb with javascript, too slow to load from a file. - Tried rolling my own database from Javascript, parsing was too slow, startup too slow. - Tried using indexeddb, was good, but writing to indexeddb is just too slow for the first run. Like really slow. - Tried decompressing database files using decompression streams and gzip compression, still slower than SQLite's loading of uncompressed database. - Switched to decompressing using a javascript zstd decompressor, speed was acceptable, but still slower than SQLite. - Started using zstd wasm modules for decompression, good, but transferring the memory from wasm to javascript was incurring a cost or impossible to implement right. - Because Grok sucks at writing C wasm modules I switched to rust for the wasm backend. Suddenly had really good speed, kept all the major processing in one wasm module. Thought that 150kb module was much better than the 1MB sqlite wasm module. - Refactored everything to work with lazy loading and lazy searching to make the app more seamless and less inefficient. - Rewrote everything from scratch in C because I understand it better, it's faster and most of my previous rust code was unsafe code anyway. Rewrote the html and javascript too. Got the size of the peak decoder binary from 150kb in rust to 52kb in C. - Had 800mb of sheet music I wanted to compress down and looked around for jbig2 support. I could turn each page into a tiny pdf, but the pdfs don't open on mobile and I wanted them to show up just as an image with no controls or nonsense. Couldn't find any readily available jbig2 decoders for javascript or wasm. (Other than the ones inside pdf.js and pdfium etc. but getting those to work with my code wasn't happening. I tried using pdf.js but it was slow, huge, and still ugly. So I had Claude AI guide me through adapting ghostscript's jbig2dec. I was gonna use libpng but that made the wasm decoder 178kb, the largest part of PeakSlab yet. I didn't like that, so I had Claude write a new frontend to jbig2dec that did custom 1-bit PNG encoding from scratch. The wasm for that is down to 92kb and it works great. - Changed some compile flags, got the core peak wasm module down to 37kb. Used Claude AI to remove jbig2.js for more space savings. - Aggressively disabled code for the generation of jbig2.wasm, got it down to 26kb.
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유