Way before computers, smart phones took over the world, all businesses ran through Human to Human communications. Mostly in the form of oral communication. When pizza was required a human will pick up the phone and call the pizza joint to order the pizza. The human on the other end would take the order, prepare it and deliver it. The same applied to almost all business from reserving a restaurant to making an appointment at the hair dresser.
But with the advancement of computers human to human communication gradually moved towards a more written form in the means of emails, messaging, etc. This meant that humans on both ends could get what they need without using oral statements which could sometimes be misinterpreted by either parties involved due to differences in accents, dialects etc.
Then came era of full machine to machine communication. You book your ride through an app in a phone (machine) which communicate via API to another to get things done for you. No oral communication required and everything happened seamlessly without a glitch. But not all businesses ended up exposing an API and were still using phone calls (oral communication) for most of their business. This is where Google Duplex comes in handy.
Google Duplex is a way of communication between a machine (initiated by machine) and an human seamlessly. No matter what accents, dialects the human use the machine has become smart enough to interpret what is being said communicate effectively to get what needs to be done. The human will never realize that he is talking to a machine because of added delays, pauses, stop words like "hmm", "ah". That is where the Artificial Intelligence comes in.
The component which decides what to say is a recurrent neural network (RNN) which is developed using Tensorflow Extended and trained on many annonymous phone calls. When the audio is provided as input, it will output Google's Automatic Speech Recognition (ASR) software, along with context such as the conversation’s history, parameters of the conversation (what, where, when, how many, etc), and more.
Finally producing the speech is done by using Google text-to-speech technologies, Wavenet and Tacotron. The stop words are inserted to make the calls made by machine more natural and human like to an actual human.
This is of course a mamath effort by Google engineers, but with ever increasing need for digitization in the coming decades more and more businesses will provide ways to do business, make appointments, etc. using digital means of communication (Websites with forms, APIs, etc) Google might not be able to harness the full potential of this if it only focuses on this type of machine to human communication use cases. But there may be other areas like call centers which are struggling to scale because of the human resource factor which could benefit heavily from this technology to provide a human (initiated by human) to machine communication mechanism to solve issues.
Lets just wait and see and the future looks interesting.