Dataflow Programming
- Summary
-
Discussion
- Could you explain dataflow programming with an example?
- What are the main benefits of dataflow programming?
- What are the main characteristics of dataflow programming?
- What are some traits expected of dataflow languages?
- Could you list some dataflow languages?
- What other programming paradigms are closely related to dataflow programming?
- What are some shortcomings or challenges with dataflow programming?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
Dataflow programming (DFP) is a programming paradigm where program execution is conceptualized as data flowing through a series of operations or transformations. Each operation may be represented as a node in a graph. Nodes are connected by directed arcs through which data flows. A node performs its operation when its input data are available. It sends out the result on the output arcs.
Due to social media (2000s) and IoT (2010s), there's a need to analyse data quickly and derive timely insights. There's also a need to explore and exploit parallel architectures. Dataflow programming aids parallelization without the added complexity that comes with traditional programming.
Dataflow programming includes two sub-fields: Flow-Based Programming (FBP) and Reactive Programming. Dataflow programs can be created in text or visually. There are many languages that support this paradigm.
Discussion
-
Could you explain dataflow programming with an example? Consider the simple program shown in the figure. In a traditional von Neumann architecture, these statements would be executed sequentially. Thus, the program would execute in three units of time.
In dataflow programming, we represent the program as a dataflow graph that has three inputs (X, Y, 10), three operations (+, /, *) and one output (C). This is a directed graph. Nodes represent instructions or operations. Arcs represent variables. Equivalently, arcs represent data dependencies between instructions. The value in Y is also duplicated since it's sent to two instructions.
In dataflow execution model, this program takes only two units of time. This is because the addition and division instructions can execute in parallel. For each of these instructions, their respective inputs are available at the start. They have no dependencies on each other and therefore can be parallelized.
A factory assembly line offers an analogy. Just as raw materials and semi-finished products move on a conveyor belt from one station to the next, data in DFP moves from one node to the next.
-
What are the main benefits of dataflow programming? With the rise multi-core architectures, there's a need to exploit these via parallel execution. Von Neumann architecture is not conducive to parallelism because of its reliance on a global program counter and global updatable memory. With DFP, each node uses local memory and executes when its inputs are ready.
DFP enables parallel execution without extra burden on the programmer. There's no need to deal with multiple threads, semaphores and deadlocks. Each node in the dataflow graph is independent of others and executes without side effects. A node executes as soon as its inputs are available. If multiple nodes are ready to execute, they can execute in parallel.
DFP has enabled many visual programming languages that provide a more user-friendly interface so that even non-technical users can write programs. Such languages are also suited for rapid prototyping.
DFP promotes loose coupling between software components that's well-suited for service-oriented architectures. Its functional style makes it easier to reason about system behaviour. Its simplicity is in its elegance.
DFP is deterministic. This enables mathematical analysis and proofs.
-
What are the main characteristics of dataflow programming? A program is defined by its nodes, I/O ports and links. A node can have multiple inputs and outputs. Links connect nodes. Links can be viewed as buffers if they need to hold multiple data packets. This is useful when sending and receiving nodes operate at different speeds. Buffers also enable pipelining whereby nodes can execute simultaneously.
A data packet can take primitive/compound, structured/unstructured values, and even include other data packets. When multiple links join, packets are merged or concatenated. For link splitting, packets are duplicated (shallow or deep copy). Equivalently, nodes can do the join and split.
Execution can be either a push or pull model. Push does eager evaluation and reacts to change. Pull does lazy evaluation and uses callbacks. Nodes in push model fire when inputs are available whereas nodes in pull model fire when output is requested.
Execution can be synchronous or asynchronous. With synchronous execution, each node take one "unit of time". Fast nodes will wait for slow nodes to complete before outputting the results. With asynchronous execution, outputs are never delayed and systems can run at full speed.
-
What are some traits expected of dataflow languages? Dataflow languages are expected to be free from side effects. There's locality of effect. Scheduling is determined from data dependencies. Variables may be seen as values since they're assigned once and never modified. Order of statements is not important.
An example of side effect is a function modifying variables in outer scope. Locality of effect means that instructions can have far-reaching data dependencies. This is easily possible in imperative languages by reusing the same variables or using
goto
statements. Neither of these should be allowed by dataflow languages. The language should be designed such that it's easy to identify data dependencies and generate graphs that expose parallelism.To address these, we can prohibit global variables and call-by-reference. Arrays are not modified but created anew when changes are made. Thus, arrays are treated as values rather than objects. Even errors are handled by error values instead of setting global status flags and interrupting the program. Since variables are really values, the statement
S=X+Y
is seen as a definition, not an assignment. In fact, languages that do all processing by applying operators to values have been called applicative languages. -
Could you list some dataflow languages? Dataflow programming is not limited to visual programming. Early languages were textual with no graphical representation. Visual languages enable end-user programming. Textual languages are faster to work with but need expert knowledge.
Textual languages of the early 1980s were VAL (MIT) and ID (Univ.Cal.Irvine). Lucid was developed for program verification but can also be used for dataflow computation. In the late 1980s, SISAL was invented. Based on VAL, it offered Pascal-like syntax and outperformed Fortran on some benchmarks. Other textual languages include CAL, COStream, Lustre, Joule, Oz, Swift (scripts), SIGNAL, SystemVerilog, and VHDL. The
make
software commonly used in UNIX systems is also an example of dataflow programming.Among DFVPLs are Keysight VEE, KNIME, LabVIEW, Microsoft Visual Programming Language, Orange, Prograph, Pure Data, Quartz Composer, and Simulink. More examples include Node-RED, Max MSP, and Lego Mindstorms.
Even spreadsheets are considered examples of visual DFP. Each cell is a node. When a cell value is updated, it sends its new value to other cells that depend on it.
-
What other programming paradigms are closely related to dataflow programming? Flow-Based Programming (FBP) is a subclass of DFP. While DFP can be synchronous or asynchronous, FBP is always asynchronous. FBP allows multiple input ports, has bounded buffers and applies back pressure when buffers fill up. Java, C#, C++, Lua and JavaScript support FBP. DrawFBP and Slang are visual languages for FBP.
Reactive Programming is event-driven, asynchronous and non-blocking. It's usually supported by callbacks and functional composition (declarative style of programming). Like DFP, focus is on the flow of data rather than the flow of control. Programming abstractions that support this include futures/promises, reactive streams, and dataflow variables. Akka Streams, Ratpack, Reactor, RxJava, and Vert.x are libraries for JVM that support reactive programming. In JavaScript, there's Bacon.js.
Like DFP, Functional Programming is free from side effects, mostly determinate and has locality of effect. However, it's possible to have a convoluted functional program that can't be implemented as a dataflow graph. It's been said that dataflow languages are essentially functional but use an imperative syntax.
Carl Hewitt's Actor Model from the 1970s is relevant to DFP. Actors are equivalent to nodes. Many languages such as Scala support the actor model.
-
What are some shortcomings or challenges with dataflow programming? Visual dataflow languages can be cumbersome for large programs with many nodes. Iterations and conditions are hard to represent visually. Debugging is difficult since multiple operations happen concurrently. In general, better tooling is needed.
Many DFVPLs aim to make programming easier for end users. The original motivation to exploit parallelism has become secondary.
DFP's deterministic nature can be a problem. Some use cases (such as ticket booking or database access) are inherently non-deterministic. To solve this, DFP must allow for non-deterministic merging of multiple input streams.
Spreadsheets, when seen via the dataflow paradigm, have their limitations. Formulas are hard to inspect. They're not suited for refactoring and reuse. Deployment and scalability are practical issues. However, solutions have been proposed to address all of these.
Among modern languages, Clojure, Groovy, Scala, Ruby, F# and C# support DFP but offer no syntactic support for dataflow variables. Even simple dataflow examples in F# and C# require lot of boilerplate code.
Milestones
The idea of dataflow networking can be traced to the work of John von Neumann and other researchers in the 1940s and 1950s in the context of "neuron net" model of finite state automata. In the early 1960s, Carl Adam Petri introduces the Petri Net model for concurrent systems. A dataflow network is a network of computers connected by communication links. Nodes are activated by the arrival of messages on their links.
Sutherland proposes a two-dimensional graphical method to define programs. He gives examples of written statements and their graphical equivalents. Where a symbol appears in two places, two equivalent connections appear in the graphical form. Intermediate values need not be named in the graph. Graphs are also explicit about multiple outputs. He uses the terms "flow input" and "flow output".
Kosinski publishes a paper titled A data flow language for operating systems programming. Programs use function composition. Primitive notion of iteration, recursion, conditionals, and data replication/aggregation/selection are employed. Researchers look to create dataflow languages since they find that imperative languages can't be compiled easily for dataflow hardware.
Davis creates the first visual dataflow language called Data-Driven Nets (DDNs). Dataflow graphs become programming aids rather than simply representations of textual programs. Practical visual programming languages appear only in the early 1980s with Graphical Programming Language (GPL) and Function Graph Language (FGL). However, their development is hampered by graphical limitations of current hardware.
Denning notes that dataflow languages are being developed at MIT, at IBM T.J. Watson Research Center, at the University of California at Irvine, and at IRIA Laboratory in France. A compiler would translate language syntax into a dataflow graph to be executed on an interpreter. When an operator is ready to run because its operands are available on the input links, the interpreter sends them as a packet to an extractor network, which routes the packet to a function unit. Results are fed back through an insertion network.
Though DFP was expected to eclipse von Neumann architecture, by mid-1980s it's clear that this hasn't happened. Researchers realize that this is probably because parallelism in dataflow architectures is too fine-grained. This starts a trend towards hybrid architectures and coarse-grained parallelism. Many dataflow instructions are grouped together and run in sequence, though obeying the rules of dataflow execution model.
Due to computers with better graphics, this decade sees the rise of Dataflow Visual Programming Languages (DFVPLs). LabView (1986) and NL (1993) are two well-known languages of the 1990s. The main motivation for DFVPLs is software engineering rather than parallelism. DFVPLs enable programmers to create programs by wiring dataflow graphs.
References
- Ackerman, W. 1982. "Data Flow Languages." IEEE Computer, vol. 15, no. 2, pp. 15-25. doi: 10.1109/MC.1982.1653938. Accessed 2020-11-15.
- Blackstock, Michael. 2016. "Node-RED: Lecture 5 – The Node-RED programming model." Node RED Programming Guide, May 6. Accessed 2020-11-15.
- Bonér, Jonas and Viktor Klang. 2016. "Reactive programming vs. Reactive systems." O'Reilly, December 2. Accessed 2020-11-15.
- Carkci, Matt. 2013. "Elements of Dataflow and Reactive Programming Systems." deepFriedCode, on YouTube, October 11. Accessed 2020-11-15.
- Denning, P.J. 1978. "Operating Systems Principles for Data Flow Networks." Computer, IEEE, vol. 11, pp. 86-96, July. doi: 10.1109/C-M.1978.218271. Accessed 2020-11-15.
- Dennis, Jack B. 1980. "Data Flow Supercomputers." IEEE Computer, vol. 13, no. 11, pp. 48-56, November. Accessed 2020-11-15.
- Herr, Stefan. 2012. "5 Major Problems of Spreadsheets Solved with Parallel Dataflow Programming." Blog, Ankhor, May 2. Accessed 2020-11-15.
- Hewitt, Carl, Peter Bishop, and Richard Steiger. 1973. "A Universal Modular ACTOR Formalism for Artificial Intelligence." IJCAI'73: Proceedings of the 3rd International Joint Conference on Artificial Intelligence, pp. 235-245, August. Accessed 2020-11-15.
- Johnston, Wesley M., J. R. Paul Hanna, and Richard J. Millar. 2004. "Advances in Dataflow Programming Languages." ACM Computing Surveys, vol. 36, no. 1, pp. 1-34, March. Accessed 2020-11-15.
- Kahn, G. 1974. "The Semantics of a Simple Language for Parallel Programming." In: Information Processing '74: Proceedings of the IFIP Congress, pp. 471-475. Accessed 2020-11-15.
- Kosinski, P.R. 1973. "A data flow language for operating systems programming." ACM SIGPLAN Notices, vol. 8, no. 9, January. Accessed 2020-11-15.
- Liptchinsky, Vitaliy. 2010. "Basics of Dataflow Programming in F# and C#." Code Project, September 3. Accessed 2020-11-15.
- Matschinske, Julian. 2018. "What the hell is dataflow programming?" Blog, BITSPARK, October 16. Accessed 2020-11-15.
- Morrison, J Paul. 2020a. "Flow-based Programming." Accessed 2020-11-15.
- Morrison, J Paul. 2020b. "Software on FBP Website." Accessed 2020-11-15.
- Powell, James. 2008. "An Introduction To Quartz Composer." Mac Tricks & Tips, March 22. Accessed 2020-11-15.
- Savas, Süleyman, Zain Ul-Abdin, and Tomas Nordström. 2018. "Designing Domain-Specific Heterogeneous Architectures from Dataflow Programs." MDPI, April 22. Accessed 2020-11-15.
- Sousa, Tiago Boldt. 2012. "Dataflow Programming: Concept, Languages and Applications." Doctoral Symposium on Informatics Engineering, University of Porto, January. Accessed 2020-11-15.
- Sutherland, William Robert. 1966. "The on-line graphical specification of computer procedures." PhD Thesis, Massachusetts Institute of Technology, January. Accessed 2020-11-15.
- Tcler's Wiki. 2017. "flow-based programming." Tcler's Wiki, March 3. Accessed 2020-11-15.
- Tcler's Wiki. 2020. "Dataflow programming." Tcler's Wiki, March 21. Accessed 2020-11-15.
- The Concord Consortium. 2018. "Computational Thinking in Biology: What is an InSPECT Dataflow Diagram?" Blog, The Concord Consortium, January 24. Accessed 2020-11-15.
- Wei, Haitao, Stéphane Zuckerman, Xiaoming Li, and Guang R.Gao. 2014. "A Dataflow Programming Language and its Compiler for Streaming Systems." Procedia Computer Science, vol. 29, 2014, pp. 1289-1298. Accessed 2020-11-15.
- Wikipedia. 2020. "Dataflow programming." Wikipedia, October 18. Accessed 2020-11-15.
Further Reading
- LabVIEW. 2010. "Data Flow Programming Basics." LabVIEW, on YouTube, July 30. Accessed 2020-11-15.
- Sousa, Tiago Boldt. 2012. "Dataflow Programming: Concept, Languages and Applications." Doctoral Symposium on Informatics Engineering, University of Porto, January. Accessed 2020-11-15.
- Johnston, Wesley M., J. R. Paul Hanna, and Richard J. Millar. 2004. "Advances in Dataflow Programming Languages." ACM Computing Surveys, vol. 36, no. 1, pp. 1-34, March. Accessed 2020-11-15.
- Humphrey, Marvin. 2019. "Dataflow Programming." Papers We Love, on YouTube, July 3. Accessed 2020-11-15.
- Morrison, J. Paul. 2016. "The origins of Flow Based Programming with J Paul Morrison." YouTube, October 15. Accessed 2020-11-15.