Open source libraries for Text and Time Series

Ben Lorica 罗瑞卡

Aug 25, 2022

Subscribe • Previous Issues

Data Exchange podcast

Unleashing the power of large language models: If you work with text, you should incorporate transformer-based language models into your NLP pipelines. You can either build your own tools or use libraries that come with pre-trained models. Maarten Grootendorst, is the author of open source libraries that I’ve come to enjoy: BERTopic (topic modeling with transformers and c-TF-IDF), PolyFuzz (fuzzy string matching), and KeyBERT (keyword extraction). All these libraries come with simple Python APIs, are well-documented, and BERTopic comes with several nice visualization tools.
Machine Learning for Time Series Intelligence: Aadyot Bhatnagar, is a Senior Research Engineer at Salesforce, and co-creator of Merlion, an open source framework for applying machine learning on time series data. Merlion supports a wide range of time series learning tasks including forecasting, anomaly detection, and change point detection. I’ve long wanted (declarative) tools that make time series analysis and modeling more accessible to non-experts. New libraries like Nixtla, Kats, Merlion, and Greykite are steps in the right direction.
Building production-ready machine learning pipelines: Hamza Tahir and Adam Probst are co-creators of ZenML, an extensible open source framework for building reproducible pipelines.

[Image: Window Front at Night from Wikimedia.]

Ray AI Runtime (AIR): A scalable and unified toolkit for ML applications

Officially announced at this week’s Ray Summit, AIR unifies Ray's existing native ML libraries to work smoothly together and integrate easily with popular ML frameworks. AIR makes it easy to run machine learning workloads in just a few lines of Python code, leaving Ray to coordinate computations at scale.

Learn More

Confidential Computing and Machine Learning

We assess the popularity of various Confidential Computing tools, and explain why Confidential Computing can now be used for analytics and machine learning (both for model inference and model training):

Read The Post

Foundation Models: A non-technical primer

In this article, Kenn So of Shasta Ventures and I review some of a class of models that have had an impact on computer vision, text, and speech applications. We list implications for product builders, entrepreneurs, and investors:

Read The Post

The best data warehouse is a lakehouse

A short summary of Databricks SQL (DBSQL) initiatives pertaining to classic data warehousing, data transformation & ingest, connectivity, and other items that are redefining analytics on the lakehouse.

If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:

Ben Lorica edits the Gradient Flow newsletter. He helps organize the Ray Summit, the NLP Summit, and the Data+AI Summit. He is the host of the Data Exchange podcast. You can follow him on Twitter @BigData. This newsletter is produced by Gradient Flow.

Discussion about this post

Ready for more?