If you're a data scientist and need to analyze loads of CSV files for insights into, say, stock-price and market movements, the Julia programming language trumps machine-learning rivals Python and R, according to Julia supporters.
Machine learning has propelled Python upwards to make it probably the most popular programming language among developers these days, along with Java and JavaScript.
However, Julia, a young language with roots in MIT's Computer Science and Artificial Intelligence Lab (CSAIL), has also become one to watch, having found a core audience among data scientists.
SEE: Virtual hiring tips for job seekers and recruiters (free PDF) (TechRepublic)
Julia is not among the top 10 programming languages that developers use but it is in the top 10 most-loved programming languages in this year's survey from Stack Overflow, putting it up there with Rust, TypeScript, Python, Kotlin, Go, Dart, C#, Swift, JavaScript and SQL.
Some languages such as Rust aren't widely used by developers but they are appreciated by programmers for qualities that excel in systems programming, versus application programming. For example, Microsoft is looking to Rust for the memory-safety features lacking in C and C++, which are extensively employed in Windows and other Microsoft projects.
Julia on the other hand has been adopted by some programmers for its C-like speed, but it has a much smaller ecosystem of packages than Python.
A recent update to Julia has improved multi-threading to offer more speed enhancements, and that's what Julia developers argue is giving it a sizable edge over Python and statistical programming language R at the task of parsing CSV files for data analysis.
According to Deepak Suresh, a machine-learning engineer at Julia Computing, multithreading capabilities give Julia libraries an advantage over both machine-learning rivals with a range of different datasets accessed from CSV files, or comma-separated values text files.
Suresh has benchmarked statistical programming language R's fread, Pandas' read_csv for Python, and Julia's CSV.Jl CSV parsers and reckons that Julia comes out on top.
"Julia's CSV.Jl is 1.5 to 5 times faster than Pandas even on a single core; with multithreading enabled, it is as fast or faster than R's read_csv," he notes.
The benchmarks were carried out on a machine with Ubuntu 18.04 powered by an Intel Xeon Silver 4114 processor running at 2.20GHz.
As he explains, Julia's CSV.Jl is the only tool that is "fully implemented in its higher-level language rather than being implemented in C and wrapped from R/Python".
The benchmarks are meant to demonstrate the speed of loading data in Julia and also indicate the performance of Julia code during data analysis.
One of the example benchmarks looks at Apple stock price states – open, high, low and close – using a 2.5GB dataset with 50 million rows and five columns.
"The single threaded CSV.Jl is about 1.5 times faster than R's fread from data.Table. With multithreading CSV.Jl is about 22 times faster. Pandas' read_csv takes 34s to read, this is slower than both R and Julia," Suresh declares.
SEE: Programming languages: Developers reveal what they love and loathe, and what pays best
Another looks at performance with a mortgage risk dataset from Google-owned data-science platform, Kaggle, which contains mixed type dataset, with 356,000 rows and 2,190 columns.
"Pandas takes 119s to read in this dataset. Single-threaded fread is about twice faster than CSV.Jl. However, with more threads Julia is either as fast or slightly faster than R," says Suresh.
Another is the acquisition dataset from US mortgage lender, Fannie Mae, which has four million rows and 25 columns.
"Single-threaded data.Table is 1.25 times faster than CSV.Jl. But, the performance of CSV.Jl keeps increasing with more threads. CSV.Jl gets about 4 times faster with multi-threading," he says.
Julia Computing says, across all eight datasets, Julia's CSV.Jl is always faster than Pandas, and with multi-threading it is competitive with R's data.Table.
Top comments (0)