I posted a little while ago about a need to convert Java classes into Typescript declarations.
The goal is to give Rhino JS super powers by using the Typescript backend, exposing and understanding what's available to the JavaScript context.
The problem is, I have some 300 jar archives making up the application which I am trying to get typescript to understand.
You can unzip a jar or use the command jar
to get the contents of that jar, from here scan the output for .class extensions, this is the first bottleneck
If 300 jars contain let's say 100 classes each well you can imagine, it's a big lot of classes.
There is some disk IO going on here but not sure how expensive it is. I am spawning and awaiting a promise in a loop to run this command, 1 at a time I suppose?
Could this be done better?
Node can handle it but my CPU on my MacBook Pro 2020 gets damn hot (nothing new here but, it's absolutely not what I want)
Then the next thing for each class in this parent loop, loop and run javap
to decompile the class and get something that can be parsed into generated typescript a step down the line. This is SLOW and even though we are using spawn it's still not ideal. There is yet more writing to disk here as we try dumping the output to disk.
Is node.js right for this application? Could I use workers or multi processes, is the bottleneck javap or jar maybe, lots of confusion 😑
Top comments (9)
I'd suggest using something like the async library to have a maximum
n
processes running at once - this should reduce thrashing. You'd use forEachLimit or something like that and create a promise awaiting the result of the external process. Try to balance the number of concurrent processes against the number of CPUs etc.Ah yes that library came up alot digging around Stack Overflow, it used to be really popular, anyway you mentioned balancing. Where I have 8 CPUs and 3xx process I should find a limit divisible by those numbers?
Yeah I still use it occasionally when I hit those kind of challenges. I'd go with a limit of 8 or 16 and see which works out best.
Yeah seems to run slower now but also more stable so it's a trade-off I'm happy with, thanks for the tips!
Seems the bottleneck is hardware so the solution should be hardware
Would it be possible (and practical) to fetch and write files from for example s3 buckets and then scale processing using apis such as lambdas or ec2s running a tiny node app, like below;
Your main app could parse the file list and distribute jobs to parse one file at a time to a scalable "parser endpoint" that gets a s3 object path, converts it and puts it in another bucket. This parser service would then scale with load on the endpoint.
You could probably write an MVP or POC with like 5 jar files
Looks like I'm going to need to learn some GCP we don't use AWS where I work 😑 not that GCP is bad.
It sounds reasonable to make an app like this, maybe I can get the whole thing running on my crappy 16 logical cores 😅 (I'm from the duel core era do anything above 4 sounds outstanding) then once I understand how my so far cobbled together node app works, take that and part it out in GCP, it's a great suggestion actually, I did wonder how far optimization of my code would go, but as I said in other threads, it's now slower and more stable because you correctly point out hardware is the bottleneck here. Still, what a learning experience this is
Your main app would then, as you were suggested, await the async ajax call and fire away like ~ 20 concurrent ajax calls and then wait until one is done before firing the next, should be a fairly simple loop
You could use multiple processes, but I suggest a simpler solution, boost your thread pool size and use parallel scheduling and scheduling queues to schedule more than one task just remember to keep the number of spawned tasks under the number of threads in the thread pool.
It's a shame I didn't see this comment, I went the fork process route (1 per jar), it's now 300 processes for allocated 'jobs'
Although it's not sequential, some jobs finish sooner than others and then when the child proc does we pass on to the next one
The multi process is a hell of a lot faster but now running kind of stalls and lags my system... One of the jobs is the classpaths processing, that could have quite a lot of
org.paths.goo
paths so I think instead of multiprocess here, maybe a queue and limit the amount of processing within that job?Your suggestions I will need to research 🦉