How I built a 3D printed, AI (ChatGPT) enabled voice assistant — and what I learned in the process

Intro

Despite being a programmer of trade, I always loved tinkering with gadgets and electronics since I was a little kid. I had fun disassembling devices and in some cases (more frequently than not by accident), repairing them in the process. With that said, I am not an electronics engineer, but I have a knack for creating some fun devices using Arduino, ESP, and Raspberry Pi “platforms”. Such platforms provide me a way to use 3 different hobbies of mine: physical hardware implementation, 3D design (with 3D printing), and of course, programming.

So when ChatGPT and the earlier OpenAI GPT models came out (AI models that can respond to written text in a strikingly realistic, human-like manner), I immediately thought that I should create a little AI “robot” voice assistant of my own. I know there are already many tutorials and articles out there, that have a similar goal in mind, but mine is likely different due to the following reasons:

1. I was not using ChatGPT to assist me with programming it. I could have used it and it probably would have made my iterations much quicker, but I resisted. More on why I did this deliberately later. For this article, I have also not used ChatGPT, only where I explicitly stated it. I used Grammarly to correct my small grammatical mistakes and typos.

2. My implementation is intended to be not just a bare-bones, rough-on-the-edges quick POC, but I was trying to employ more sophisticated solutions so that in theory the device could be a basis for an actual commercial device. Examples:

I am using high-quality text-to-speech voices (Azure), that although has a free tier, is not a free API
Similarly, I am using Azure speech-to-text (voice recognition), because it’s quite quick and picks up my native language (Hungarian) very well
I have built a web admin UI for the base application so that one can customize the behavior and other settings of the voice assistant to a great extent (more on this also later)

3. Due to the above goal, I was doing quite a lot of research into how to use off-the-shelf hardware which works reliably for such a voice assistant. I ended up using only commercially available components that can be purchased and assembled relatively easily.

4. As for the “body” of the voice assistant, I designed my own in Tinkercad and 3D printed it (having to do a few iterations). So this part of the bot is also custom-made and quite personal.

After the quite long introduction (sorry about that), let’s dig into the details and see how the AI voice assistant was made.

The voice assistant, in pieces:

The final “product”:

Since we are anyway talking about AI and its recent developments, let’s explore for a brief moment what other forms of generative AI exist today, that can help create a new product. There has been a recent influx of various image generator AI models, that can take a prompt (natural language description of how the image should look), and then it generates sometimes frighteningly realistic images (you might even say photos). I provided a simple prompt to one of the most popular of such AIs, Mindjourney (v5). It generated the following set of images. Not bad!

Initial goal

On a high level, I had the following goals:

Create a device that I can talk to (recognize my speech), and which device would respond using TTS voices. The response should be based on a Large Language Model AI model, like the OpenAI ChatGPT or the GPT3 model
Leverage my existing skills and build this device using a Raspberry Pi (I initially thought about doing it on ESP32, but after some research, I realized that it would yield a suboptimal solution)
Build a relatively nice case for it. I ended up creating a rabbit-looking “body” in Tinkercad
Brush up my Python programming skills in the process, and program the “orchestration” in Python

Here is a quick demo of what this device can currently do (more details in later sections of this article):

Hardware

As mentioned before, my goal was to use off-the-shelf hardware. This however required quite some experimenting with various options. This meant the following components:

Base device

I first evaluated ESP32 but quickly realized that it will not be suitable (fast and responsive enough). Then I did a POC using Orange Pi Zero LTS 512MB, which worked quite well, however, I switched to another voice recognition solution (from Google to Azure), which did not work on the 32-bit ARM architecture of this Orange Pi board. So I finally ended up using Raspberry 4, which proved more than enough for the project. If I were to productize this device, likely the Raspberry Compute modules would be cheaper and more compact alternatives.

Sound input (microphone)

This was one most time-consuming things to figure out. I wanted to find the least complex solution, so I tried hard to get a low-level microphone connected to the Raspberry. I found that there are a couple of I2C interface mics available that could be made to work, e.g. ones that are based on the INMP441 chipset. These are very compact and are supposed to have high sensitivity and high SNR (signal-to-noise ratio), so they looked great. I got them to work, but after lengthy tweaking with various Linux sound system (alsa) settings, I could never get this mic to pick up sounds reliably further than 1–2 meters. Then I bought a very compact USB microphone, which barely sticks out of the USB port. Ultimately, this also proved to be a dead end: it had a lot of background noise (also static sound) and was also not picking up sounds very well, only when I was close to the device.
One night, I was busy trying to get the mic to work when I noticed the Logitech webcam on the top of my monitor, and I gave it a try. Success! The recorded audio had very low background noise, yet it could pick up sounds even from across the room. I figured that these higher-end webcams must have some clever noise canceling and audio enhancement chips built into the hardware, hence they will always be superior. So I did a quick research and found that the Logitech C270 was relatively cheap (I could purchase a few used ones for even cheaper) yet its mic audio quality was great! So I tossed the goal of making the mic as compact as possible and decided that the webcam approach is the way to go. I might revisit this later, but so far, I am happy with the results I am getting with the webcam.
Ended up using a Logitech webcam for the mic, as it proved to be far superior to the other alternatives (e.g. electret microphones and I2C solutions)

Sound output (speaker)

Here, I also had initially a naive goal: surely, the speaker is something that can be made as low level as possible: the Raspberry has a jack audio output, so all I need is a small amplifier module (e.g. one based on LM386 chip) + a regular 1W / 8 Ohm small speaker and I would be good to go. Here, I also failed. The audio was pretty loud, but it was picking up quite a lot of static noise, even with various grounds on the Raspberry. Learning from my microphone endeavor, I settled on a cheap yet quite good-sounding USB speaker. I would have wished for something a bit more compact, but it’s still pretty good and fits in the case nicely.

Screen

I wanted to have the voice assistant have a “face”. Such a face could show a limited set of expressions (to make the bot more relatable), and also a screen could provide a way to show the “state” of the application. States like “listening”, “thinking”, and “speaking”. For this, the easiest way is to use an LCD screen that can communicate over the SPI protocol. There are quite a few such LCDs available, with different controllers. There’s a Python library that supports most of these screens, with various resolutions and sizes. The refresh rate is not that great due to this library, because Python is not very fast for such applications, but it works quite well. I bought a screen on Aliexpress with the ILI9341 controller, which has a resolution of 320×240.

Battery

Finally, I also researched what it would take to have the device “untethered” in the sense that it would not need to be plugged into a power outlet. At first, I thought I would buy a LiPo battery with a charger module, but then I got scared after reading how such batteries could explode when shorted or charged the wrong way. 🙂 So as with the previous components, I ended up using a readily available product: a small 10000mAh power bank (Hoco Q3). It’s quite compact and yet powerful, so I could fit it into the bottom of the “body”. I have not yet done extensive measurements, but based on my calculations, it would power the voice assistant for a couple of hours (~5–6 or more) at least. One important thing to look for when buying a power bank, is that according to the Raspberry Pi 4 min. requirements, it should be able to supply at least 3A. Although I can imagine that it would work with a less powerful power bank.

Body

I have a 3D printer, so it was no question that I would design the body of the voice assistant myself, and then 3D print it. I am not much of a 3D designer myself, but I managed to create a relatively nice-looking and compact enclosure in Tinkercad that fits all the components, though a bit snugly. I the nice rounded ears I got from another 3D model, but the rest was done in Tinkercad. I have done a few iterations (this is where a 3D printer is really useful). Still, some of the parts and holes of the model were added retrospectively e.g. by using a regular drill. 🙂 Would I had devoted a bit more time, I would have designed better “snapping” connectors between the parts, and a better pushbutton compartment. I hid the microphone piece (webcam electronics stripped down from its original housing) in the ears. It was not a perfect fit, so I had to fill the hole in with some additional pieces to make the webcam PCB stay in its place. But overall, creating the body yourself is a pretty rewarding experience.

Software

As for how the software goes, I wanted to build the “orchestration” of the various hardware components and used APIs myself. I also had the goal to brush up my Python programming with this exercise.
On a high-level, the logic is pretty simple: initialize SDKs (Azure and OpenAI) at the beginning, start continuous speech recognition, send the recognized speech of the user to an OpenAI GPT model, and finally “speak” the response using text to speech. In each “state”, output an image to the LCD screen. All this is done in Python using the relevant libraries:

azure.cognitiveservices.speech — for text to speech and speech to text (voice recognition)
openai — for the AI language model
luma.core and luma.lcd — for displaying images and text on the LCD screen
and few others for GPIO handling (button press detection), YAML parsing (config file), backoff (for API request retries)

Here is how it works (might be not 100% precise UML, but I wanted to convey the main logic):

The code itself is not that nice, though I created standalone Python modules for the GPT processing, the LCD manipulation, and configuration handling, it’s still a bit like spaghetti code. But if someone would like to take a look at it, I am happy to share, just send me a message.

Using ChatGPT and other OpenAI APIs

The Artificial Intelligence that makes this voice assistant very convincing is a so-called Large Language Model. The current leader of this field of AI is the company called OpenAI (in case you have not heard of it yet, because you have lived under a rock). I asked their flagship product, ChatGPT, to explain itself, and what it is exactly:

A large language model is a type of artificial intelligence (AI) model that is trained on vast amounts of textual data and can generate human-like language in response to a given prompt or input. Large language models have significantly improved the capabilities of natural language processing (NLP) and have been applied in various applications such as language translation, sentiment analysis, and chatbots.
OpenAI GPT models (Generative Pre-trained Transformer) are a family of large language models developed by OpenAI, a leading AI research laboratory. These models are trained using a technique called unsupervised learning, where the model learns from a vast amount of unstructured text data from the internet. The GPT models use a transformer architecture, which enables them to process and generate text in a context-aware and human-like way. GPT-3, the latest and largest version of the GPT model, has 175 billion parameters, making it one of the most powerful language models in existence. It can generate text that is difficult to distinguish from text written by a human and has the potential to revolutionize the field of NLP.

I would already need to correct what the AI had to say about itself: it humbly said that it would “revolutionize the field of NLP”. That’s “a bit” of an understatement: such models will revolutionize everything, and not in 5–10 years, but in 1–2 years. More on what my take on this, is at end of this article. When I saw how good these models were, my first reaction was: “this will change the game”. (Actually, my first reaction was a little less profound, think “holy” and the s word. :))

But what can such models do? Well, anything that you can do by manipulating language, but also much more. Anything that can be expressed by words, might be a more precise explanation. So it can do software programming (write computer code) very effectively, it can describe — using words and characters — many more. It’s probably one of the most versatile tools currently available on the internet, despite being “text only”.

But let’s focus on what I have used for my voice assistant: quite simply, I “ask” it to respond to the user. OpenAI provides API access to its language models so that one can build a new application using the capabilities of ChatGPT and other models. When I started, the only suitable model was the davinci003 model, it worked pretty well, but it was (is) pretty expensive: https://openai.com/pricing#language-models. For this model, you need to pay $0.02 for 1000 tokens, which does not seem that much at first, but the way how these models can keep the context of the conversation, essentially you need to send the whole chat history for each API request. This means that e.g. if you continue the conversation for an hour, this chat history can get pretty big and expensive (since both the chat history and the response of the API consumes tokens). This context is required so that the voice assistant “remembers” earlier parts of the conversation, e.g. if you introduce yourself or tell something personal that it does not “know” by itself. Of course, you can get away without this chat history, and you can also decide to prune or “compress” parts of the chat history to reduce the payload. A token, by the way, is not exactly a character, you can get a feel for how it works with this official calculator: https://platform.openai.com/tokenizer (An interesting caveat is that more exotic languages, e.g. Hungarian or Chinese consume significantly more tokens per word on average).

Since then, OpenAI released the LLM that is the basis of the ChatGPT offering, the gpt-3.5-turbo model: https://platform.openai.com/docs/models/gpt-3-5. It provides more coherent responses (at least for such chat-like use cases) and it costs 10 times less. With this new model, the whole voice assistant became much more affordable.

The official documentation on how you can consume the models via APIs is pretty well described in the OpenAI documentation: https://platform.openai.com/docs/guides/chat. So I will not describe it here in detail. I would rather focus on a few things I learned:

Response time of the API and reliability: one of the things which still makes my voice assistant lacking, is the response time of the API calls. Sometimes it takes a few seconds (even 8–10 seconds) to generate a response. And it does not always correlate with the length of the prompt. It’s quite unreliable sometimes. And sometimes, there are even downright outages when the API is down: https://status.openai.com/
Various models: I settled on the gpt-3.5-turbo AI model from OpenAI, since it’s the least expensive, and the quality of the response is quite good even in multiple languages, like German, Italian, Hungarian, etc…
Parameters you can tune: there’s just one parameter that is worth tweaking and experimenting with: the temperature. Quoting the official documentation on what this param does:
“… between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.”
The new GPT4 model: the most recent version of the GPT family is the GPT4 model, which is said to be superior to the GPT3.5 version. But it’s also the most expensive one, which makes it cost-prohibitive to use as a voice assistant (for now, since we can expect that the price will decrease as computation costs go down). I have yet not done extensive tests for the voice assistant use case with this model.
Consuming the API on Azure: Microsoft is also hosting the OpenAI language models on Azure, and the price is the same. The problem however is that it’s currently in limited availability: only selected Azure customers can access it. But based on various reports, the response time of the Azure-hosted API endpoints are much better, so I hope I will be able to use it eventually.
Hallucinations: when building a commercial product around these large language model-based AIs, one of the biggest challenges will be that such models are inherently prone to hallucinations.

Hallucination in AI refers to the generation of outputs that may sound plausible but are either factually incorrect or unrelated to the given context. These outputs often emerge from the AI model’s inherent biases, lack of real-world understanding, or training data limitations. — https://bernardmarr.com/chatgpt-what-are-hallucinations-and-why-are-they-a-problem-for-ai-systems/

There’s also a lengthy Wikipedia article about what hallucinations are in the context of AI models, which is worth reading: https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)

Azure TTS and STT APIs

Naturally, a voice-enabled AI assistant needs to reliably understand spoken voice (STT), and the response should be a natural-sounding text-to-speech (TTS) synthesis. There are quite a few options, both “on the edge” (offline, processing entirely on the device) and cloud versions (using APIs). I had one special requirement which narrowed down the available options: I wanted the voice assistant to speak my own language, which is Hungarian. And I did test a few offline packages that can do voice recognition and text-to-speech, but I was not convinced with the quality. So I started evaluating cloud providers, namely Google Cloud and Microsoft (Azure).

At first, I got the Google Cloud version up and running and it worked pretty well. But then I checked the audio quality of Azure voices, and they were far better, especially the Hungarian voices, which also had not only one but two versions (male and female voices). So, I switched to Azure.

There’s a pretty generous free tier, but after a few hours worth of TTS and STT, you need to pay: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/

My experience is pretty positive with Azure voice recognition accuracy and the quality of the TTS voices. You can even change the pitch and rate of the voice using SSML markup. How speech recognition works is that it does a “streaming recognition”, which means that there’s continuous recognition happening. The Speech SDK emits a “recognized” event once there’s a slight pause, which the SDK classifies as one “utterance”. So e.g. if you start speaking, even say a few sentences, then you pause (there’s silence), you will get all the recognized sentences, up to a maximum of 15 seconds. This approach works well for a voice assistant use case.

There’s one thing that is a bit “hacky”: when the voice assistant speaks, I wanted to disable speech recognition. The most reliable way I found was to mute the microphone on the OS level for the duration of the speaking part.

For implementing the Python code, I used the available official guides:

Building the Web Admin UI

I was able to test and improve my voice assistant for a while, but I got the idea that it should have a simple admin UI where I could configure a few things to rapidly change configuration (and to test how the AI “behaves”):

the initial prompt of the GPT API call to significantly alter the “personality” of the bot
the used GPT model (ability to e.g. switch to the GPT4 model for a few tests)
maximum number of tokens (to prevent very long AI responses)
temperature parameter of the AI model (temperature defines the “creativity” of the responses, trading it for loss of consistency and “factuality”)
voice settings: used Azure voice (ability to change it to a different language so that you can converse in a different language), volume, pitch, and speaking rate
settings to switch on the display of the recognized text and AI-generated response directly on the LCD (mainly for debugging)

For this, I found a nice framework called nicegui. It was very easy to put together a config Web UI in Python, which acts also as a web server. I was also contemplating the tried and tested Flask framework, but I liked the fact that nicegui provided some ready-made UI elements and it seemed to be easy to get up and running.

And sure enough, I got it working after a few hours of work. I settled on using yaml for the config storage, there’s a good library available for Python. This is how the current Web UI looks now:

One limitation (more like a design principle) of nicegui, is that it does not provide user-specific sessions out of the box, which means that if you open the page in two different browsers, you will see that the session is shared. This is not much of a problem for such admin UIs though. I would also need to add some security measures, e.g. a simple password-based authentication at least.

Even before setting up the web admin UI, I came across a nice service called PiTunnel: https://www.pitunnel.com/.

It provides the following very handy features (among others):

Even if you have a dynamic IP for the Raspberry Pi, and it’s connected to the internet via a regular router, it provides web-based SSH terminal to the device from anywhere in the world. So you can easily SSH into your Raspberry without having to set up port forwarding, etc. on the router to which your Rpi is connected to
You can also create a custom tunnel to any port. This makes it possible to access the web server running on your Raspberry on a fixed address (in the free version of PiTunnel, the fixed address changes frequently, but you can always check it first on your PiTunnel account. If you pay like $5 a month, you can get a permanent address for your Raspberry)

Use cases for the voice assistant

There are many fun use cases of how you can create a “personality” for such voice assistants by simply creating a so-called initial prompt. The initial prompt defines how the AI model should “behave” and respond to user requests. The possibilities here are almost limitless, except that there are deliberate safety measures (guardrails) implemented on top of the GPT models to prevent (or at least try to prevent) the misuse of these AI models e.g. spread of fake news, hate speech, and dangerous content (e.g. how build a bomb). But even the innocent use cases are very fun! Here are just a few that I was testing:

Voice assistant acting as a sarcastic rabbit

The “personality” from the original video was posted at the beginning of the article. Here is the actual initial prompt for this in case you are wondering:

You are Bun, a sarcastic bunny. You are 8 years old. You live in a forest.
You reply in an offensive way. Reply always in short, few-sentence replies, unless necessary. Try to reply in short. Never admit that you are an AI or a chatbot.
You reply always in an offensive, sarcastic way. You always tell me what you think.

Virtual doctor recommending treatment

A much more interesting and serious possible use case is to use the voice assistant as a virtual doctor, who will analyze your symptoms and recommend treatment. Here is the prompt for this kind of “behavior”:

I want you to act as a virtual doctor. Your name is Brian. I will describe my symptoms and you will provide a diagnosis and treatment plan. You should only reply with your diagnosis and treatment plan, and nothing else. You are a voice-assisted virtual doctor, so the user is speaking instead of writing. Don’t respond by asking the user to write but tell them to speak instead.

Personal English Teacher

I also tried to make the voice assistant act like a language tutor:

I want you to act as a spoken English teacher and improver. Your name is Mary. I will speak to you in English and you will reply to me in English to practice my spoken English. I want you to keep your reply neat, limiting the reply to 100 words. I want you to strictly correct my grammar mistakes and factual errors. Don’t correct typos because I am not writing but speaking. I want you to ask me a question in your reply. Now let’s start practicing, you could ask me a question first. Remember, I want you to strictly correct my grammar mistakes.

Testing in the wild

What better way to test a prototype than testing it with children? Not just one, not two, but maybe four children, all of them at same time. This is precisely what I have done, on multiple occasions.

Key findings of the tests:

All 4 kids were often speaking at the same time. Interestingly, most of the time the response of the AI was still pretty good and close to what was asked
I have no idea why, but one of the most popular prompts for the voice assistant was to tell a joke, despite it (at least in Hungarian) telling quite bad jokes
Children are not known for their utmost patience, so sometimes it was difficult for them to wait for the AI response (which — I suppose when the OpenAI servers were busy — took 5–10 seconds)
I will need to implement a way to stop the response if the AI misunderstands the prompt, e.g. by saying: “stop speaking” or “that is not what I meant”, etc… This would require that the program keeps listening for such “commands” during speaking, which is not that easy since the microphone also would pick up the sound of the voice assistant itself. So the code would need to “ignore” its own voice but listen for commands from users.
But overall, they had a lot of fun speaking to the voice assistant on various subjects

Further ideas for improvement

I am quite happy with the current features of the voice assistant, I learned a lot in the process. However, I do have a few ideas that could further improve it:

Hotword detection (~ Hey Siri): Currently, the push button serves as a way to “pause” the voice assistant and prevent it from continuously listening and responding. Though, voice assistants are usually activated using a hotword, e.g. Hey Siri or OK Google. For this, I would need to have an offline solution, that does not require continuous API access for the recognition. I found a few Python packages for this, but I have yet to try them.
Eliminate latency for response: to make the conversation flow really “fluid”, the latency for the response should decrease to 1-second max I believe. This is currently not possible with the public OpenAI APIs. I assume partly the reason is that OpenAI cannot cope with the current demand, even though they have Microsoft backing them with infrastructure
Make use of the camera: provided I have a bit more time later, I should combine 2 different AIs into one: image “recognition” with large language models. Although it might be possible sooner rather than later with GPT4, as it’s already multimodal (can understand both text and image). This multimodal mode is not yet released to the public.
More expressive facial emotions: I should make the “face” of the assistant more expressive: with animations and different kinds of expressions (GPT is also able to detect the “sentiment” of its own responses: to tell whether the response is a happy, a sad or e.g. a funny or angry kind of response).
Trying out other LLMs: OpenAI is currently the biggest, most advanced player in the field of large language models, but others (Google, Facebook, and Amazon) are catching up, and so are smaller players. I predict in a few months, high-quality competitor LLMs can be used for a fraction of the cost of the current cheapest OpenAI model.

What I learned in the process

Building and testing this voice assistant was (is) my real exposure to what these AI models are capable of. I haven’t played that much with them on the OpenAI chat website. Through prompting and instructing the model for the voice assistant, I can see what the models can do, and I can already foresee at least some of the dangers (see next section for my take on these dangers and risks).

On the hardware side of things, I also learned a lot, and ultimately a big learning was not to dismiss off-the-shelf, ready-to-use products due to “pride”. Pride in the sense that I spent many sleepless nights trying to get the low-level components to work reliably (e.g. the various kinds of microphones, and the speakers with an amplifier). But that in itself was also quite educational.

I could indeed improve my Python programming skills, so I think I could land a junior Python programmer job (if I wanted).

Finally, conducting tests of the voice assistant with children was also something that provided a few learnings: after you work days and weeks on a product, it’s very possible that the first testers will interact with the product in a way that will prove that you still have a long way to go to have a reliable and useful thing.

My fears and hopes

My own profession is at risk: you have probably read about this in a lot of articles already: one of the jobs that many thought safe from AI at least for a decade — is programming. Just a few years ago, experts thought that programming requires such creative and analytical thinking, that it would be very difficult to replicate with an algorithm / AI model. Yet here we are, large language models — through learning the language of humanity based on huge amounts of texts — also learned the language of programming quite effectively. The latest models like GPT4 can already do program code at a level of medior or sometimes senior programmer. There’s still the need for someone to “orchestrate” the app-building process, but with a few prompts, even someone who did not program before can create actual working software. It can not only write the program itself but instruct the user on how to set up the tools for creating a runnable program.
What is my take on this?
Well, I have a confession to make: I am scared and being stubborn at the same time. You see, I am scared that if I try these “tools” for a longer time. with all different kinds of tasks, I myself will also confirm that they can do my job very well. On the other hand, I am being stubborn, because I love programming. I don’t want the AI to take away the fun (and sometimes struggles). And if the AI is doing all the work, how would I “exercise” my brain? So even though I know that ChatGPT and GPT4 could do most of what I do as hobby projects, I won’t let it!
Spreading of misinformation / hallucinating bots: you could read about these language models hallucinating in many places. They can very convincingly state things that are complete rubbish (not based on ground truth and facts). And I have made a few naive experiments myself: I instructed the language model to always say that the sky is red, no matter what. It was working. But this was just a very simple example. The way how you can write news articles, comments, etc. using these language models, will be much more subtle. You might not even notice. And there’s a high risk that the internet will be flooded with such generated content. It’s already happening I fear. Sure, fake news and misinformation have been a thing for years, but now anyone with very basic English skills can use these models to create very convincing misinformation campaigns, spam emails, posts, and the like…
Future of learning for my children: just by creating this voice assistant, I can see how valuable a similar device could be in education. They could explain complex topics effortlessly. And they won’t get mad if you ask them multiple times to explain something or ask them to explain in a different way. You can think of already using this technology as your own personal tutor. Yet, due to deliberate or unintentional mistakes of these (hallucinations for example), you can never be 100% sure that what you hear is actually true. There’s a debate if this (hallucination) can be completely eliminated from such models. Whatever the answer may be, education will also be disrupted significantly in the coming years.
Job displacement/unemployment: I bet that not only my fellow programmers will see their jobs disrupted, but there are actually quite a few professions already who might be displaced (fired, to put it bluntly) very soon: creative writers, journalists, graphic designers, copywriters, and many others. Choosing your education path in the coming years is ever more important.

Despite all of these (and many other dangers I have not listed), I am trying hard to stay hopeful. I sincerely believe that the rapid development and improvement of AI — with some luck — will benefit humanity, and it can bring a new, more positive equilibrium to the world.

thoughts and ideas and everything

How I built a 3D printed, AI (ChatGPT) enabled voice assistant — and what I learned in the process

Intro

Initial goal

Hardware

Base device

Sound input (microphone)

Sound output (speaker)

Screen

Battery

Body

Software

Using ChatGPT and other OpenAI APIs

Azure TTS and STT APIs

Building the Web Admin UI

Use cases for the voice assistant

Voice assistant acting as a sarcastic rabbit

Virtual doctor recommending treatment

Personal English Teacher

Testing in the wild

Further ideas for improvement

What I learned in the process

My fears and hopes

Leave a Reply Cancel reply

How I built a 3D printed, AI (ChatGPT) enabled voice assistant — and what I learned in the process

Intro

Initial goal

Hardware

Base device

Sound input (microphone)

Sound output (speaker)

Screen

Battery

Body

Software

Using ChatGPT and other OpenAI APIs

Azure TTS and STT APIs

Building the Web Admin UI

Use cases for the voice assistant

Voice assistant acting as a sarcastic rabbit

Virtual doctor recommending treatment

Personal English Teacher

Testing in the wild

Further ideas for improvement

What I learned in the process

My fears and hopes

Duenna AI – Tanulj nyelveket, mesterséges intelligencia segítségével!

A mesterséges intelligencia és a szingularitás

Leave a Reply Cancel reply