PyData NYC 2024

Adopting Open-Source Tools for Time Series Forecasting

Speakers: Udisha Dutta Chowdhury, Abishek Murthy

Projector appeared to be not working so had to follow along the slides on my phone.

Comparing a couple of different libraries for time series forecasting:

Sktime
Skforecast
Darts
Neural Prophet

When comparing open source libraries what you’re looking for is their ability to:

Data Understanding: How well does it perform exploratory data analysis (EDA)
Data preparation Hope well does it handle quality issues such as missing values, duplicates, NaNs, exogenous variables
Backtesting

Speaker seemed to gravitate towards sktime and has very positive sentiments on it.

How many dataframe libraries do you need to change a lightbulb?

Speakers: Dharhas Pothina (Quansight)

Mostly focus on Pandas, Polars, & DuckDB

Polars & DuckDB does not use indices which simplifies the API.

Polars has simple data types and it uses Arrow data types under the hood.

Pandas supports geospatial data, Complex numbers which Polars doesn’t support.

Polars and DuckDB are both single-node, support lazy execution, and does in-process OLAP (Online analytical processing)

Polars performs better on window operations, but DuckDB scales extremely well. Polars has sorted flags, which is then remembered for the rest of the operations, also has eager execution, whereas DuckDB is only lazy.

Main takeaway was to use these in new projects, and if you like SQL then use DuckDB, but if you don’t then Polars might serve you better as it’s more similar to pandas and provides a more expressive language. The example given for the more expressive language was to show how many lines it took to find the most expensive lightbulb per lightbulb manufacturer.

Narwhals by Marcos (Quansight and came up with a lot of the information in this talk) purpose is for library maintainers to define DataFrame logic once.

Reproducible work environments for data scientists using Nix

Speakers: Avik Basu (Staff Data Scientist Intuit)

In the Music Box (like the morning the projector is not running)

Nix one of the interesting things I haven’t thought about but is integral to the reproducible builds in Nix is having functions without side effects.

NIx shellHook is a constructor to run when entering the shell.

Pushing Cython to its Limits in Scikit-learn

Speakers: Thomas J. Fan (SciKit Learn Maintainer)

Performance Uplift: HistGradientBoosting

Use Cython because it’s kinda like Python, and it reduces memory usage by doing less validations [TODO]

Developing: cython --annotate simple.pyx - “Remove yellow for code to go faster”

Tip: For Jupyter Notebook developers using %%cython --annotate to keep running the cell until the yellow lines go away.

Memoryview directives are available such as @cython.boundscheck

Fused types are interesting, it allows you to define one type or class which can be of some set of types, and then at compile time cython makes N functions where N is the number of types in your fused type (it also combinatorially scales, i.e. N x M if your function had two fused types which had N & M types respectively)

Temporal Analysis on Topics Utilizing Word2Vec

Speakers: Vishesh Narayan Gupta, Angad Sandhu, Faizan Wajid

Problem statement: Large amounts of health reports for nursing home, and they wanted to be able to consume that information easier via some dashboard.

Traditional approach would have been through topic modelling and running stochastic counting, but this wouldn’t allow them to see the “semantic” evolution of words.

Interesting showing of “cdc” and “unofficial” and how throughout 2020 they started showing closer distances meaning more closely related semantically and an example of how disinformation can propagate.

Saving Sharks… with Python, Causal Inference and Bayesian Stats!

Speakers: Alexandre Andorra (LBS)

Guest was Alex Andorra

Alex mentioned failing their first year statistics course, as part of the department he now manages(?), funny tidbit but seems to have been mentioned as a way of measuring the signal to noise ratio vs. maybe about measuring aptitude.

ArcticDB, the OLAP antidote

Speakers: James Munro

Starts off the motivation about how a query in BQ for processing 0.5TB of data and returning 200K rows takes ~15 seconds and costs ~ $2.5 -$ 3. Argues that this is slow and costly if you’re just looking to get back some rows of data with some filtering.

Daunting to Doable: How Institutions can encourage Open Source Contributions

Speakers: Kaushik Srinivasan (Bloomberg)

Bloomberg sales pitch predominantly ~20 minutes in. Make that the whole presentation.

Can You Tell the Truth by Lying?

Speakers: James Powel

Grouped Kurtosis

Key part of pandas is that its index alignment, and always having an index.

Quartz 4

Explorer