DEV Community

Cover image for Gigatron: A pure bleeding edge monorepo for enterprise machine learning development
Dr. GP Pulipaka
Dr. GP Pulipaka

Posted on • Edited on

Gigatron: A pure bleeding edge monorepo for enterprise machine learning development

The data science can be defined as the convergence of Computer Science, programming, mathematical modeling, data analytics, academic expertise, traditional AI research, and applying the statistical techniques through scientific programming tools such as Python, R, TensorFlow, Java, on an ecosystem of SQL, NoSQL, GraphDB, streaming computing platforms such as Apache Spark, Apache Kafka, Apache Storm, Apache Nifi, Apache Flink, Apache Geode, and linked data to extract new knowledge discovery through data patterns and provide new insights from distributed computing platform from the tsunami of big data. Though, many times it is possible to define the statistical language models, it’s difficult to implement through object-oriented programming languages. Therefore, it is critical to wear the hats of an advanced programmer, infrastructure architect to provide web scale performance with in-memory computing and apply traditional research with machine learning and deep learning algorithms to create novel architectures unique to each enterprise and avoid one-size fits all approach. Real-time analysis is all the rage in the data science industry. Therefore, leveraging in-memory computing ecosystems can provide faster execution results to the corporations. In 1944, Freemont Rider authored the book “The Scholar and the Future of the Research Library. A Problem and Its Solution.” The book is a discussion of the growth of American university libraries based on the growth of the available data in the world. Sheldon and Moore predicted the libraries would double in size every 16 years, and that the Yale University library will reach 200 million volumes by 2040. Sheldon and Moore also predicted that the shelves would occupy 6,000 miles of length in the library. The first time the term database was coined was when Sebastian and Coleman asserted, “to capture the sense that the information stored in a computer could be conceptualized, structured, and manipulated independently of the particular machine on which it resided.” There have been two popular databases that revolutionized commercial use of database applications. These were network models on data systems language (CODASYL) and the hierarchical database model information management system (IMS).
Price proposed the law of the exponential increase. According to his prediction, the growth of scientific knowledge would double every 15 years. Price also predicted that every half-century would see an increase in knowledge by a factor of 10, which would tremendously increase the number of scientific journals and papers by an equivalent factor. Codd published A Relational Model of Data for Large Shared Data Banks, which led to the start of relational database models. Codd defined the logical database schema that laid out the cornerstone for relational database models to the world. In 1974, the University of California at Berkeley funded the relational database product Ingres that uses query language (QUEL). IBM funded the relational database product System R with a structured English query language (SEQUEL). System R subsequently led to the creation of competition with Microsoft’s structured query language server (MSSQL) as well as Oracle Sybase products. During this time, the relational database management system (RDBMS) was widely recognized. In the 1980s, structured query language (SQL) became the standard query language for a large number of the relational database management products. IBM introduced database two (DB2) as their robust version of relational database management. In the 1990s, the object database management system (ODBMS) was created. Subsequently, a golden digital era began with the advent of the Internet in the mid 1990s. Data grew exponentially as an increase in the variety of data types in online transactions resulted. In 1998, Mashey7, a chief scientist at Silicon Graphics Incorporated (SGI) presented a paper entitled Big Data and the Next Wave of InfraStress at a Unix users’ group (USENIX) meeting. In the paper, Mashey presented and discussed different data formats such as audio and video, and the exponential growth needs of data on physical storage systems and the needs to extract the data per user expectations. Data Science term was first coined in 1999 by the mathematician William S. Cleveland at the meeting of the International Statistical Institute and a subsequent paper was published by Cleveland in 2001, "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” In that paper, at that time, the definition was extended to a) Statistical theory b) Statistical models (c) Statistical and machine learning methods (d) Algorithms for statistical and machine learning methods, and optimization (e) Systems with Computer Science and programming for data analysis (f) Live analyses of data where the results are analyzed by the findings, but not the scientific methodology and computer systems that were used. SQL was highly popular for the extraction of the data for performing complex programming analysis for decades.

Data science is also applied to a number of industries and not restricted to a particular industry. Some people constantly work in the same industry, for example, some people continue to work in utilities industry and become experts in that industry and know what machine learning models need to be built and how to apply these particular models to utilities industry. If you take any statistician, the whole term data science appears to be patronizing. I’m not trying to discredit the field of statistics or statisticians. However, it’s vitally important to remember that statistics is only a branch of greater data science and it’s not the data science by itself. 95% of the people with the titles data scientists are not data scientists, they could be statisticians. It’s not about writing theoretical statistical equations on operational research. It’s about applying that to any special branch of machine learning, for example, reinforcement learning in a programming language such as PyTorch, TensorFlow, Python, or R. Data science is not pure statistics, I can’t bring a data scientist to my organization in my team who knows what is Naïve Bayes algorithm, but has no clue on how to apply that in Python or R programming language to solve a particular problem. Probabilistic inference is great in an industry to understand causality, but at least familiarity with AutoML would help the data scientists with no programming background to succeed in the industry. The Middle-of-the-Road data scientists are data analysts or statisticians, but not data scientists. This is where Computer Science takes the precedence. Whatever the theory in statistics you bring, you should be able to apply that in a program through Computer Science or it remains a theory with no reproducibility of the results. I worked for Fortune 100 corporations in a broad range of industries such as Aerospace, Manufacturing, Semiconductor, AFS (Apparel Footwear Solutions), Media and Entertainment, Automotive, CCS (Customer Care services for Utilities), Energy, Retail, High-Tech, Life Sciences, Chemicals, Banking, Service, Artificial Intelligence, Hardware, FPGA designing companies, and CPG (Consumer Packaged goods). Considering I implemented 30 consulting projects in a span of 20+ years, I understand and apply the master data management and data governance strategies, analytics, business intelligence, programming, machine learning models, and statistics, particularly with a framework of tools to design, develop, and deploy machine learning and deep learning applications to solve problems in real-world with reinforcement learning algorithms in natural language processing, speech recognition, text to speech, chatbots, and speech to text analytics in PyTorch, Python, TensorFlow, and R.

Data Science is no longer an industry-term. There are practical exams, research papers, and dissertations in the academia. Most of the PhD candidates with data science and machine learning background are absorbed into companies right after the completion through the campus recruitments. In some cases, the PhD candidates turn into entrepreneurs with an inventive idea of running data science organizations fueling the industry with innovative products. Larry Page, who did a dissertation as part of his Computer Science PhD dissertation program developed a PageRank algorithm on BackRub, a web with 10 million documents. BackRub’s web was built in Python and Java running on Linux. This academic project turned into Google and subsequent Alphabet Inc with an equity of $177 billion with a revenue of $136 billion for 2018. The original results produced by any PhD candidate in the field of Computer Science, data science, robotics, big data analytics, and statistics have solid foundation for future applications in the field of data science. I saw a number of trade shows and conferences, where some data scientists bring some presentations with cartoons and present it for longer duration. My suggestion would be not to be AI-washed with these marketing and advertising materials at trade shows. The machine intelligence of algorithms is now distributed in a cloud-computing environment and
will aid the organizations in future to discover valuable insights and perform several operations through APIs. Organizations are mass-manufacturing algorithms since it meets economies of scale in a
distributed environment. Artificial intelligence is the new inferno for powering AI winter (that lasted from 1990s through 2010s) with the machine intelligence platforms through deep learning to prototype rapidly and deploy in production from sandboxes. A number of open-source machine
learning and deep learning platforms have been released in the recent times such as TensorFlow by Google, Caffe by University of Berkeley, NLTK (Natural Language Tookit) by University of Pennsylvania for natural language processing, Scikit-learn machine learning library for Python, a number of R packages for deep learning and machine learning, Theano a numerical computation library for Python, Torch a platform for developing machine learning and deep learning with an underlying C implementation. However, PyTorch, a replacement of numpy package in Python has taken artificial intelligence to the next level with built-in support for GPUs with tensor computations.

Developing large-scale web applications require full stack skills, portability, maintainability, and building complex architectures to solve complex problems. The full-stack web development requires skills of data science such as coming up with architecture, scalability, and implementation through programming in Java, JavaScript, ReactJS, VueJS, TensorFlowJS, GoLang and a number of varying degrees of JavaScript applications. TensorFlowJS, JavaScript, and GoLang have full-scale availability of machine learning algorithms to apply in the field of web development. A full-stack data scientist breaks the silos in the organization and works across multiple teams and disciplines of data science to implement the architecture either for web development or for the mobile development. Therefore, data science is a convergence of several disciplines with a number of frameworks and tools such as SQL, NoSQL, Python, PyTorch, R, Java, and TensorFlow and the ability to implement across distributed computing, cloud computing, supercomputing applications. In web development and software engineering, monorepo is a software development strategy to maintain a single large-scale code repository that consists a number of projects that may or may not be correlated with each other. There are significant advantages of maintaining such monorepos in web development for working collaboratively across the teams allowing the data scientists to perform code refactoring and reuse of the code across the projects. Some companies that deploy the repositories of the projects individually have a number of dependency problems with the code deployment. The monorepo allows the data scientists to ignore such dependency issues with atomic commits and shared dependencies across the projects. A number of companies implement such monorepos successfully for AI, machine learning, and web development are Google and Twitter etc. Gigatron is a good boilerplate that can be implemented by a number of organizations to implement their next Full-stack monorepo project.

About Author

Dr. Ganapathi Pulipaka is a Chief Data Scientist at Accenture for AI strategy, architecture, application development of Machine learning, Deep Learning algorithms with experience in deep learning reinforcement learning algorithms, IoT platforms, Python, R, and TensorFlow, Big Data, IaaS, IoT, Data Science, Blockchain, Apache Hadoop, Apache Kafka, Apache Spark, Apache Storm, Apache Flink, SQL, NoSQL, Mathematics, Data Mining, Statistical Framework, SIEM with SAP Cloud Platform Integration, AWS, Azure, GCP and 20+ years of experience as SAP Technical Development and Integration Lead with 30 project implementations for Fortune 100 companies and 9+ Years of AI Research and Development Experience.

Dr. Ganapathi Pulipaka, Postdoc
Chief Data Scientist, DeepSingularity LLC
Computer Science Engineering,
Winner of Top 50 Tech Awards in AI, Machine Learning, and Data Science
LinkedIn | Twitter | Facebook | GitHub | Website | Email | Phone

You can get in touch with him on the following social media channels:
LinkedIn : https://www.linkedin.com/in/dr-ganapathi-pulipaka-56417a2/
Twitter: https://twitter.com/gp_pulipaka
Facebook: https://www.facebook.com/ganapathipulipaka/
Github. : https://github.com/GPSingularity
Website 1: www.gppulipaka.org
Website 2: www.deepsingularity.io
Email: Ganapathi.Pulipaka@deepsingularity.io
Phone. : 323-898-7112

References

Cleveland, W. S. (2001). Data Science. Retrieved from http://www.stat.purdue.edu/~wsc/William.S.Cleveland.pdf
Demchenko, Y., Belloum, A., Laat, C. D., Loomis, C., Wiktorski, T., & Speckshoor, E. (2017, December 11, 2017). Customizable Data Science Educational Environment: From Competences Management and Curriculum Design to Virtual Labs On-Demand. IEEE International Conference on Cloud Computing Technology Science. http://dx.doi.org/10.1109/CloudCom.2017.59
FotonTech (2019). The best boilerplate for your Monorepo Fullstack projects. Retrieved from https://github.com/FotonTech/gigatron
Pulipaka, G. (2015). Big Data Appliances for In-Memory Computing: A Real-World Research Guide for Corporations to Tame and Wrangle Their Data (2 ed.). California: High Performance Institute of Technology.
Saltz, J. S., & Grady, . W. (2017, December 14). The ambiguity of data science team roles and the need for a data science workforce framework. IEEE International Conference on Big Data (Big Data). http://dx.doi.org/10.1109/BigData.2017.8258190

Top comments (0)